AWS Lesson 26 of 123

RDS Proxy in Production: Connection Pooling, Failover Acceleration, and IAM Authentication

A relational database has a hard ceiling on concurrent connections, and that ceiling is far lower than the concurrency your serverless and container fleets can generate. A db.r6g.large PostgreSQL instance defaults to roughly 850 max_connections; a single Lambda burst can ask for thousands of fresh TCP sessions in seconds, each one a forked backend process with real memory cost. The database does not gracefully shed that load. It runs out of connection slots, new sessions get FATAL: remaining connection slots are reserved, and your healthy application starts throwing 500s because it cannot open a socket — not because the query was slow.

RDS Proxy sits between the application and the database, maintains a warm pool of database connections, and multiplexes a large number of short-lived client connections onto a small number of long-lived database connections. It also holds client sockets open during a failover and routes them to the new writer, which shrinks application-observed failover time dramatically. This article is the production build: provisioning, the connection pinning trap that silently destroys multiplexing, tuning the pool under load, IAM authentication, Lambda and VPC wiring, and the CloudWatch signals that tell you whether the proxy is actually helping.

By the end you will stop treating “the database ran out of connections” as a sizing problem to be solved by a bigger instance. You will know how to put a managed pooler in the path, how to keep multiplexing actually multiplexing, how to make a failover invisible to in-flight transactions, and how to delete the last long-lived database password from your application entirely. The metric that predicts your next outage on this stack is not CPU — it is DatabaseConnectionsCurrentlySessionPinned, and knowing why is the whole point.

What problem this solves

The pain is specific and it shows up under load, never in a unit test. Your application scales horizontally — Lambda fans out, ECS tasks multiply, an autoscaling group adds instances — and each new compute unit opens its own connection pool to the database. The database has no idea these pools are mostly idle. It allocates a backend process per connection (Postgres) or a thread (MySQL), each one holding memory and a process-table slot, and when the count crosses max_connections it rejects everything new with FATAL: remaining connection slots are reserved for non-replication superuser connections. A perfectly healthy database, with CPU at 20%, starts refusing your traffic.

What breaks without a pooler: engineers raise max_connections (which raises memory pressure and eventually the instance OOMs), or scale the database vertically (expensive, and the next traffic doubling brings it back), or — the worst outcome — cap application concurrency to protect the database, throttling the very elasticity they paid for. The real fix is to stop demanding one backend per client. A connection pooler lets a thousand clients share fifty backends because at any instant only the clients mid-transaction need one.

Who hits this hardest: Lambda-backed APIs (each cold-start environment opens a connection, and a scale-out is a connection storm), Fargate/ECS microservices with per-task pools, anything with spiky traffic (flash sales, batch windows, cron fan-outs), and teams running Aurora Serverless v2 or Multi-AZ who also want failover to be a one-second stall instead of a fleet-wide reconnect flood. RDS Proxy answers all three needs — pooling, failover acceleration, and modern IAM/Secrets-Manager authentication — in one managed, VPC-native service. The catch, and the reason this is an advanced topic, is that the pooling silently degrades to passthrough the moment a client touches session state, and you pay for the proxy while getting none of its benefit. Most of this article is about not letting that happen.

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand RDS/Aurora fundamentals: that a DB instance class (db.r6g.large, db.r6g.2xlarge) sets memory and therefore the default max_connections; that Aurora separates a cluster into a writer and zero-or-more readers with a cluster (writer) and reader endpoint; and that a Multi-AZ deployment fails over by promoting a standby/reader and flipping DNS. You should be comfortable with aws CLI, reading JSON output, IAM roles and policies, Secrets Manager, basic VPC concepts (subnets, security groups), and either PostgreSQL or MySQL client tooling. Familiarity with Lambda’s execution-environment lifecycle (init vs handler scope) helps for the integration section.

This sits in the Databases track and is downstream of the core RDS/Aurora knowledge in RDS and Aurora Deep Dive: Engines, Multi-AZ, Replicas, Backups. It pairs tightly with Aurora High Availability and Global Database for Zero-Downtime, because the proxy is the front door that makes that HA design felt by the application, and with Lambda Performance: Cold Starts, Provisioned Concurrency and SnapStart, since Lambda concurrency is the load that creates the connection storm in the first place. The IAM-auth section assumes the policy-evaluation model from IAM Fundamentals: Users, Roles, Policies and Evaluation, and the secret handling builds on Secrets Manager and Parameter Store Deep Dive.

A quick map of who owns each layer, so you escalate to the right team during an incident:

Layer What lives here Who usually owns it Failure classes it can cause
Application / driver Connection lifecycle, prepared statements, SET usage App / dev team Pinning, per-request connect storms
IAM / Secrets Manager rds-db:connect, proxy role, backend secret Security / platform AUTH_FAILURE, token rejection, secret rotation breakage
RDS Proxy Pool config, endpoints, TLS, target group Platform / DBRE Borrow-timeout saturation, pinning amplification
Aurora / RDS cluster max_connections, writer/reader, failover DBRE / platform Slot exhaustion (if bypassed), replica lag
VPC / security groups Subnets, SG chain on 5432/3306 Network team Connectivity timeouts, blocked path
CloudWatch Proxy + DB metrics, alarms SRE / platform Blind spots if unmonitored

Core concepts

Five mental models make every later decision obvious.

The proxy is a multiplexer, not a passthrough. Without a proxy, every application connection maps 1:1 to a database backend for the life of the connection — even while idle. With transaction-level multiplexing, a database connection is borrowed from the pool when a client begins a transaction and returned when it commits or rolls back. Between transactions the client holds no backend. So a thousand clients that each spend 2% of their time inside a transaction can be served by roughly fifty backends. This is the same idea as PgBouncer in transaction mode, but managed, autoscaling, VPC-native, and integrated with IAM and Secrets Manager.

Pinning is multiplexing’s off switch, and it is correctness, not a bug. A backend connection can only be returned to the pool and handed to a different client if it carries no leftover session state. The moment a client does something that mutates session state the next client would inherit — SET search_path, a temp table, a session-level advisory lock, a session-scoped prepared statement — the proxy pins that backend to that client until the client disconnects, refusing to leak state across tenants. Enough pinning and your “pooled” proxy is a 1:1 passthrough you are paying extra for.

Two authentication layers are independent. The proxy authenticates to the database using credentials from Secrets Manager (always — never inline). Clients authenticate to the proxy using either a database password or, better, an IAM token minted via SigV4 that expires in 15 minutes. iam_auth = REQUIRED controls the client→proxy layer; secret_arn controls the proxy→database layer. Conflating these two is the single most common provisioning mistake.

Failover is a held connection, not a reconnect storm. During an Aurora/Multi-AZ failover the writer’s DNS flips to a promoted instance. Applications connecting directly must notice the broken socket, re-resolve DNS (subject to TTL and resolver caches), and reconnect — and a fleet doing this at once is a thundering herd that can knock over the freshly promoted writer. The proxy instead holds the client socket open, absorbs the reconnect internally, and re-routes the held connection to the new writer. The application sees a brief stall on in-flight work, not a flood of connection refused.

The pool is finite and shared, so headroom is math. max_connections_percent caps the proxy’s backends as a percentage of the target’s max_connections. At 75% of an 850-slot instance the proxy can open ~637 backends; the remaining slots are for admin tools, replication, and any direct connections. If you front one database with multiple proxies or also allow direct app traffic, their percentages must sum to under 100 with margin, or something gets remaining connection slots are reserved.

The vocabulary in one table

Pin down every moving part before the deep sections; the glossary repeats these for lookup.

Concept One-line definition Where it lives Why it matters
RDS Proxy Managed connection pooler in your VPC In front of RDS/Aurora The whole subject; pools + failover + auth
Multiplexing Many clients share few backends per transaction Inside the proxy The benefit you are buying
Backend connection A real database connection in the pool Proxy ↔ database The scarce resource being shared
Client connection A connection from app/Lambda to the proxy App ↔ proxy Cheap; can vastly outnumber backends
Pinning Dedicating a backend 1:1 to one client Inside the proxy Multiplexing’s off switch; the #1 trap
Target group Pool config + the target it fronts On the proxy Where you tune max_connections_percent
Proxy endpoint A DNS name clients connect to On the proxy Default (RW) and extra RO endpoints
iam_auth Whether clients use IAM tokens to the proxy Auth config REQUIRED deletes the app password
secret_arn Secrets Manager creds the proxy uses to the DB Auth config How the proxy logs into the database
require_tls Force TLS on client connections Proxy config Pairs with sslmode=verify-full
rds_iam DB role that authenticates only via IAM Database (Postgres role) Grants a user IAM-only login
Borrow timeout How long a client waits for a free backend Pool config Backpressure valve under saturation

1. The connection-exhaustion problem and what multiplexing buys you

The math is unforgiving. Postgres allocates a backend process per connection; even idle, each one consumes work_mem-adjacent memory and a proc slot. MySQL allocates a thread per connection with its own per-thread buffers. The relationship between application concurrency and database connections is the whole game:

The proxy does this through transaction-level multiplexing: a database connection is borrowed when a client begins a transaction and returned when the transaction commits or rolls back. Between transactions the client holds no backend.

Default connection ceilings scale with instance memory, so the slot you are protecting is small relative to serverless fan-out. Approximate PostgreSQL defaults (the formula is memory-derived; treat these as ballpark, not contractual):

Instance class vCPU / RAM Approx default max_connections One Lambda burst can demand Verdict without a proxy
db.t4g.medium 2 / 4 GB ~410 1000s Exhausts in seconds under fan-out
db.r6g.large 2 / 16 GB ~850 1000s Exhausts under a moderate burst
db.r6g.xlarge 4 / 32 GB ~1700 1000s Survives small bursts, not flash sales
db.r6g.2xlarge 8 / 64 GB ~3400 1000s Large headroom, but still finite + costly
db.r6g.4xlarge 16 / 128 GB ~5000 (capped) 1000s Memory-bound; raising it risks OOM

The multiplexing ratio you can expect depends almost entirely on how long clients hold a transaction — the lower the in-transaction duty cycle, the more clients one backend serves:

Workload shape Avg txn duration In-transaction duty cycle Clients per backend (approx) Notes
Short OLTP point reads 2–5 ms ~1% 50–100:1 Ideal for the proxy
Mixed CRUD API 10–30 ms ~3–5% 20–40:1 Typical web API
Reporting / aggregates 200 ms–2 s 20–60% 2–5:1 Pooling helps little; isolate these
Long transactions / locks seconds+ ~100% ~1:1 Effectively pinned; anti-pattern
Batch import (COPY) minutes ~100% ~1:1 Run direct, not through the proxy

The caveat that defines everything downstream: multiplexing only works when the database connection is safe to hand to a different client after each transaction. Section 3 is about not tripping the pinning that turns the table above into a uniform 1:1.

The CloudWatch metrics that tell you whether the proxy is helping — keep this open while tuning, because the ratio between them is the proof of multiplexing:

Metric What it measures Healthy reading Unhealthy reading What it tells you
ClientConnections Open client→proxy connections High (your fan-out) The demand side
DatabaseConnectionsCurrentlyBorrowed Backends in active use now Far below ClientConnections ClientConnections Multiplexing ratio
DatabaseConnectionsCurrentlySessionPinned Backends pinned 1:1 Near zero ClientConnections Pinning has degraded the pool
DatabaseConnectionsBorrowLatency Wait to borrow a backend Single-digit ms Hundreds of ms / seconds Pool saturation / pinning
DatabaseConnections Total backends open max_connections_percent cap At the cap Headroom remaining
QueryDatabaseResponseLatency DB-side query time Your SLO Rising Backend slow vs proxy slow
MaxDatabaseConnectionsAllowed The current backend cap Stable What max_connections_percent resolves to

The hard limits and quotas to design against — a few are adjustable, most are mechanism facts:

Limit / quota Value Adjustable? Consequence of hitting it
IAM auth token lifetime 15 minutes No Mint a fresh token per connection
connection_borrow_timeout range 0–3600 s Yes Clients error past this when pool is full
idle_client_timeout range 1–28800 s Yes Idle clients reaped at this value
max_connections_percent 1–100% of target Yes >100% summed → DB slot exhaustion
Proxies per region Service quota (raisable) Yes (support) Can’t create more proxies
Subnets required ≥2 in ≥2 AZs No Create fails without AZ spread
Engines supported PostgreSQL, MySQL, SQL Server No Other engines unsupported
Secret keys required username, password (+host/port/engine) No Missing keys → AUTH_FAILURE

RDS Proxy supports MySQL and PostgreSQL on both Aurora and RDS. It is a managed, autoscaling fleet inside your VPC; you pay per vCPU-hour of the underlying database instance class it fronts, which is why the cost trade-off in the Cost & sizing section is real and not an afterthought.

2. Provisioning the proxy: secrets, IAM, and target groups

The proxy needs three things wired correctly: a Secrets Manager secret holding the database credentials it uses to open backend connections, an IAM role granting it read access to that secret, and a target pointing at your cluster or instance.

First, the secret. RDS Proxy authenticates to the database with credentials from Secrets Manager — never inline. The secret JSON must use these keys:

{
  "username": "app_proxy_user",
  "password": "REDACTED",
  "host": "prod-aurora.cluster-abc123.us-east-1.rds.amazonaws.com",
  "port": 5432,
  "engine": "postgres",
  "dbname": "appdb"
}

Store it:

aws secretsmanager create-secret \
  --name prod/aurora/proxy-user \
  --secret-string file://proxy-secret.json

The proxy’s IAM role needs to read that secret and decrypt it with the KMS key:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadProxySecret",
      "Effect": "Allow",
      "Action": ["secretsmanager:GetSecretValue"],
      "Resource": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/aurora/proxy-user-*"
    },
    {
      "Sid": "DecryptSecret",
      "Effect": "Allow",
      "Action": ["kms:Decrypt"],
      "Resource": "arn:aws:kms:us-east-1:111122223333:key/KMS-KEY-ID",
      "Condition": {
        "StringEquals": { "kms:ViaService": "secretsmanager.us-east-1.amazonaws.com" }
      }
    }
  ]
}

The role’s trust policy must allow rds.amazonaws.com to assume it. Now create the proxy. Terraform is the durable way to express this:

resource "aws_db_proxy" "aurora" {
  name                   = "prod-aurora-proxy"
  engine_family          = "POSTGRESQL"
  role_arn               = aws_iam_role.proxy.arn
  vpc_subnet_ids         = var.private_subnet_ids
  vpc_security_group_ids = [aws_security_group.proxy.id]

  require_tls            = true
  idle_client_timeout    = 1800   # seconds; reap idle clients
  debug_logging          = false  # enable temporarily to capture SQL in pinning logs

  auth {
    auth_scheme = "SECRETS"
    secret_arn  = aws_secretsmanager_secret.proxy_user.arn
    iam_auth    = "REQUIRED"   # force IAM token auth from clients
    description = "app proxy user"
  }
}

resource "aws_db_proxy_default_target_group" "aurora" {
  db_proxy_name = aws_db_proxy.aurora.name

  connection_pool_config {
    max_connections_percent      = 75
    max_idle_connections_percent = 50
    connection_borrow_timeout    = 120
  }
}

resource "aws_db_proxy_target" "aurora" {
  db_proxy_name         = aws_db_proxy.aurora.name
  target_group_name     = aws_db_proxy_default_target_group.aurora.name
  db_cluster_identifier = aws_rds_cluster.aurora.id   # Aurora cluster (use db_instance_identifier for RDS)
}

The same shape via CLI, for a one-off or to script around Terraform:

aws rds create-db-proxy \
  --db-proxy-name prod-aurora-proxy \
  --engine-family POSTGRESQL \
  --role-arn arn:aws:iam::111122223333:role/rds-proxy-role \
  --vpc-subnet-ids subnet-aaa subnet-bbb \
  --vpc-security-group-ids sg-proxy \
  --require-tls \
  --auth '[{"AuthScheme":"SECRETS","SecretArn":"arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/aurora/proxy-user","IAMAuth":"REQUIRED"}]'

Every top-level proxy setting, its values, default, and the trade-off — set these deliberately:

Setting Values Default When to change Trade-off / gotcha
engine_family POSTGRESQL / MYSQL / SQLSERVER none (required) Match the engine Cannot change after create; recreate to switch
require_tls true / false false Always true in prod Clients must support TLS; pair with verify-full
idle_client_timeout 1–28800 s 1800 s Lower to reap idle Lambda clients Too low drops legit idle keep-alives
debug_logging true / false false Temporarily, to read pinning reasons Logs SQL text → cost + sensitive data; turn off
auth.auth_scheme SECRETS SECRETS n/a Backend creds always come from Secrets Manager
auth.iam_auth DISABLED / REQUIRED DISABLED REQUIRED to delete app passwords REQUIRED forces SigV4 tokens from every client
auth.secret_arn a secret ARN none Per backend user Role must GetSecretValue + kms:Decrypt
vpc_security_group_ids SG ids none (required) Per environment This SG is the DB’s trust boundary
vpc_subnet_ids ≥2 subnets in ≥2 AZs none (required) Per environment Use private subnets only

Two distinctions that bite people:

Target health tells you whether provisioning actually worked. The states and what each reason means:

TargetHealth.State Reason What it means Fix
AVAILABLE Proxy can reach and authenticate to the target Nothing; route traffic
UNAVAILABLE AUTH_FAILURE Secret creds wrong, or role can’t read the secret Fix secret value; grant GetSecretValue/kms:Decrypt
UNAVAILABLE CONNECTION_FAILURE SG/subnet/routing blocks proxy → DB on the port Open DB SG to proxy SG on 5432/3306
UNAVAILABLE PENDING_PROXY_CAPACITY Proxy still provisioning capacity Wait; transient after create
UNAVAILABLE INVALID_REPLICATION_STATE Reader endpoint with no eligible reader Add a reader to the Aurora cluster
REGISTERING Target just added, health-checking Wait a minute

3. Connection pinning: the silent killer of multiplexing

Pinning is the single most important RDS Proxy concept to internalize. When a client does something that makes a backend connection unsafe to share, the proxy stops multiplexing that connection and dedicates it to the client until the client disconnects. Enough pinning and your “pooled” proxy degrades into a 1:1 passthrough — you pay for the proxy and get none of the multiplexing.

For PostgreSQL, common pinning triggers include:

The pinning trigger reference

This is the table to keep open while you audit an ORM. Each trigger, why it pins, and the non-pinning alternative:

Trigger Engine Why it pins Confirm it Non-pinning alternative
SET search_path PG Session state next client inherits Pinning log “SET” reason ALTER ROLE u SET search_path=...
SET TIME ZONE / SET any GUC PG Mutates session GUC Pinning log Set on role/parameter group
Temp table (CREATE TEMP TABLE) PG/MySQL Session-scoped object Pinning log “temporary” CTE, or ON COMMIT DROP (still pins)
Session advisory lock (pg_advisory_lock) PG Outlives the transaction Pinning log “lock” pg_advisory_xact_lock (txn-scoped)
Server-side prepared statements PG/MySQL Named statement bound to session Pinning log “prepared” prepareThreshold=0 / simple query mode
LISTEN / NOTIFY PG Session subscription state Pinning log App-level pub/sub (SNS/SQS/EventBridge)
SET application_name PG Session variable Pinning log Pass via connection string once, accept pin
User-defined session variables (@var) MySQL Session-scoped state Pinning log Rework to per-statement values
GET_LOCK() MySQL Session-level named lock Pinning log Application-level locking
Multi-statement / unclosed transaction both Borrow can’t be returned safely BorrowLatency + Pinned rise Keep transactions short; commit promptly

You detect pinning two ways. The DatabaseConnectionsCurrentlySessionPinned CloudWatch metric shows how many connections are pinned right now; if it tracks close to your client connection count, multiplexing is effectively off. For root cause, enable proxy logging and read the pinning reason in the logs:

fields @timestamp, @message
| filter @message like /pinned/
| sort @timestamp desc
| limit 50

A pinning log line names the cause, for example a session variable being set, so you can map it back to a query pattern.

Practical fixes:

Driver-specific knobs that turn off the prepared-statement pin, with the cost of each:

Driver / ORM Setting Value to avoid pinning Cost of disabling
PostgreSQL JDBC prepareThreshold 0 No server-side plan cache
psycopg2 (Python) use simple cursor.execute avoid prepare=True Re-parse per statement
psycopg3 (Python) prepare_threshold None / high Loses prepared reuse
node-postgres (pg) named queries avoid name: on queries Re-parse per statement
Npgsql (.NET) Max Auto Prepare 0 No auto-prepared cache
SQLAlchemy (PG) use_native_hstore etc. avoid server-side SET in events Move config to role
Hibernate hibernate.jdbc.use_get_generated_keys etc. avoid session SETs Minor; move config to role
Sequelize / Prisma disable prepared statements flag per-adapter Re-parse per statement

How the two main engines differ where it matters for the proxy — the auth role name, the lock primitives, and the prepared-statement story are all engine-specific:

Aspect PostgreSQL MySQL
IAM-auth grant GRANT rds_iam TO user IDENTIFIED WITH AWSAuthenticationPlugin
Default port 5432 3306
Connection model Process per backend Thread per backend
Session-lock primitive that pins pg_advisory_lock GET_LOCK()
Non-pinning lock alternative pg_advisory_xact_lock App-level locking
search_path equivalent pin SET search_path USE db / session vars
Prepared-statement pin fix prepareThreshold=0 disable client prepared cache
engine_family value POSTGRESQL MYSQL
SET-pinning opt-out filter EXCLUDE_VARIABLE_SETS (varies)

Pinning is not a bug, it is correctness. The proxy refuses to leak one client’s session state into another’s. The job is to write query patterns that do not require session state to survive across transactions.

4. Tuning the pool: MaxConnectionsPercent, idle, and borrow timeout

Three knobs in connection_pool_config govern behavior under load.

max_connections_percent caps the proxy’s database connections as a percentage of the target’s max_connections. At 75% against an 850-slot instance, the proxy will open up to ~637 backends. Leave headroom: administrative tools, replication, and direct connections also consume slots. If you front one database with multiple proxies or also allow direct app connections, their percentages must sum to under 100 with margin.

max_idle_connections_percent sets how many backends the proxy keeps warm when demand drops. Higher means faster response to the next burst (no cold connect) at the cost of holding idle backends. For spiky serverless traffic, keeping this meaningfully above zero (e.g. 50%) avoids re-establishing connections on every burst; for steady traffic you can run it lower to free slots. It must be less than or equal to max_connections_percent.

connection_borrow_timeout is how long a client waits for a backend when the pool is saturated before the proxy returns an error. This is your backpressure valve. Under a connection storm with the pool maxed, clients queue here. A short timeout fails fast and sheds load (good for Lambda, where a hung invocation burns billed duration and concurrency); a longer timeout absorbs brief spikes without errors. Tune it against your client’s own timeout so the proxy errors before the client gives up, giving you a clean signal.

The full pool-config matrix — every knob, its range, default, and the trade-off:

Setting Range Default When to change Trade-off / limit
max_connections_percent 1–100 100 Lower to leave DB headroom Too low → borrow-timeout errors
max_idle_connections_percent 0–max_connections_percent 50 Higher for spiky, lower for steady High = idle backends held; low = cold connects
connection_borrow_timeout 0–3600 s 120 s Short for Lambda (fail fast) Too short fails under brief spikes
session_pinning_filters e.g. EXCLUDE_VARIABLE_SETS none Opt out of pinning on SET (PG) Only safe if state is harmless to share
init_query a SQL string none Run a safe SET on each new backend Adds latency per new backend

A starting point for a Lambda-heavy workload:

connection_pool_config {
  max_connections_percent      = 75
  max_idle_connections_percent = 50
  connection_borrow_timeout    = 30   # fail fast; let Lambda retry rather than hang
}

The headroom math, worked for an 850-slot instance, so multiple consumers never sum past the ceiling:

Consumer Allocation Backends (of 850) Running total Slots left
RDS Proxy (max_connections_percent 75) 75% ~637 637 213
Replication / rds_superuser reserve fixed ~10 647 203
Admin / migration tooling (direct) ad hoc ~20 667 183
A second proxy (e.g. analytics, 15%) 15% ~127 794 56
Safety margin 794 56

The signal to watch while tuning is DatabaseConnectionsBorrowLatency. If it climbs, clients are queuing for backends and you are either pinning too aggressively or max_connections_percent is too low for the offered load. How the knobs differ between a spiky serverless workload and a steady one:

Knob Spiky serverless (Lambda) Steady (always-on services) Why
max_idle_connections_percent High (50–60) Low (10–25) Spiky needs a warm pool ready for the next burst
connection_borrow_timeout Short (10–30 s) Longer (60–120 s) Lambda should fail fast and retry; services can wait
max_connections_percent Moderate (60–80) Higher (80–90) Serverless leaves room for re-warming bursts
idle_client_timeout Lower (300–900 s) Higher (1800+ s) Reap dead Lambda clients quickly

5. Failover acceleration and reader endpoints

The second reason to run RDS Proxy — independent of pooling — is failover behavior. During an Aurora or RDS Multi-AZ failover, the writer’s DNS flips to a new instance. Applications connecting directly must detect the broken connection, re-resolve DNS (subject to TTL and cached resolvers), and reconnect — and a fleet doing this simultaneously is a reconnect storm that can knock over the freshly promoted writer.

RDS Proxy changes the failure mode. It holds the client’s connection open, absorbs the reconnect against the database internally, and routes the held client connection to the new writer once promotion completes. The application sees a brief pause on in-flight transactions rather than a flood of connection errors. This is the single biggest lever for shrinking application-observed failover time, and it is why the related Aurora HA build puts the proxy in front of the writer as a baseline.

Direct-connect failover vs through-the-proxy failover, hop by hop:

Failover phase Direct connection Through RDS Proxy
Writer goes down Sockets break; app sees errors Proxy detects internally; holds client socket
DNS re-resolution App waits on TTL + resolver cache (seconds) Proxy re-resolves internally; client unaware
Reconnect Whole fleet reconnects at once (thundering herd) Proxy reconnects from the warm pool
Promoted writer load Hit by simultaneous reconnect storm Sees a bounded, paced reconnection
App-observed impact Burst of connection refused, 5xx Brief stall on in-flight transactions
Typical felt duration Tens of seconds (fleet-dependent) A few seconds, mostly the promotion itself

For Aurora clusters, create a read-only proxy endpoint so read traffic uses the proxy too and benefits from the same pooling and failover handling:

aws rds create-db-proxy-endpoint \
  --db-proxy-name prod-aurora-proxy \
  --db-proxy-endpoint-name prod-aurora-proxy-ro \
  --target-role READ_ONLY \
  --vpc-subnet-ids subnet-aaa subnet-bbb

Route writes at the default (read-write) endpoint and reads at the READ_ONLY endpoint. The endpoint types and what they target:

Endpoint TargetRole Routes to Use for Requirement
Default (created with proxy) READ_WRITE Current writer Writes + read-after-write Always present
Read-only endpoint READ_ONLY Readers (load-balanced) Scale-out reads ≥1 reader in the cluster
Additional RW endpoint READ_WRITE Writer Per-VPC or extra SG scoping Aurora cluster target

Two operational notes:

6. Enforcing TLS and IAM database authentication

Long-lived database passwords sitting in application config or environment variables are the credential you most want to eliminate. RDS Proxy lets you replace them with IAM authentication: the application calls AWS to mint a short-lived (15-minute) token and uses it as the database password. No static secret in the app; access is governed by IAM and fully logged.

With iam_auth = REQUIRED and require_tls = true on the proxy, the path is:

  1. The application’s IAM principal must hold rds-db:connect for the specific database user it logs in as:
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "rds-db:connect",
    "Resource": "arn:aws:rds-db:us-east-1:111122223333:dbuser:prx-0abc123def456/app_user"
  }]
}

The resource ARN uses the proxy resource ID (prx-..., from describe-db-proxies), not the database instance ID, and the trailing segment is the database username. Scope it to exactly the user(s) the principal may assume.

  1. The application generates a token at connect time and uses it as the password:
export PGPASSWORD="$(aws rds generate-db-auth-token \
  --hostname prod-aurora-proxy.proxy-abc123.us-east-1.rds.amazonaws.com \
  --port 5432 \
  --username app_user \
  --region us-east-1)"

psql "host=prod-aurora-proxy.proxy-abc123.us-east-1.rds.amazonaws.com \
      port=5432 user=app_user dbname=appdb sslmode=verify-full \
      sslrootcert=/etc/ssl/certs/rds-combined-ca-bundle.pem"
  1. The database user must be allowed to authenticate via IAM. For PostgreSQL, grant the rds_iam role; once a role has rds_iam, it authenticates only via IAM tokens:
CREATE USER app_user;
GRANT rds_iam TO app_user;
GRANT CONNECT ON DATABASE appdb TO app_user;

Use sslmode=verify-full (not require) with the RDS CA bundle so the client verifies the proxy’s certificate and hostname, defeating man-in-the-middle. The token is generated with SigV4 — possession of it requires valid IAM credentials at that moment, and it expires in 15 minutes, so a leaked token is short-lived. The result: the app holds no database password, only an IAM identity.

The two auth layers side by side, so you never conflate them again:

Dimension Client → Proxy Proxy → Database
Mechanism IAM SigV4 token (or DB password) Secrets Manager credentials
Controlled by iam_auth (DISABLED/REQUIRED) secret_arn
Credential lifetime 15 minutes (token) Until rotated
IAM permission needed rds-db:connect on prx-*/user GetSecretValue + kms:Decrypt
DB-side requirement Role granted rds_iam (PG) User/password exist in the DB
What a leak exposes A 15-min token, IAM-scoped The backend secret (rotate it)

PostgreSQL sslmode values and why only verify-full is acceptable in production:

sslmode Encrypts? Verifies CA? Verifies hostname? MITM-safe? Verdict
disable No No No No Never
allow / prefer Maybe No No No Never for prod
require Yes No No No Encryption only; still spoofable
verify-ca Yes Yes No Partial Better, but hostname unchecked
verify-full Yes Yes Yes Yes The production setting

The client connection parameters that matter for a proxy + IAM-auth setup, and the value to use:

Parameter Value for RDS Proxy + IAM Why
host The proxy endpoint (not the DB endpoint) Connect to the pool, not the database
port 5432 (PG) / 3306 (MySQL) Engine default
user The DB user granted rds_iam Must match the rds-db:connect ARN
password Output of generate-db-auth-token Short-lived SigV4 token, not a static secret
sslmode verify-full Encrypt + verify cert and hostname
sslrootcert rds-combined-ca-bundle.pem The RDS CA chain to verify against
dbname Your database
connection scope (Lambda) init/module scope Reuse across warm invocations
reconnect trigger on OperationalError Survive failover gracefully

The IAM rds-db:connect resource ARN, decoded segment by segment (a frequent source of “access denied”):

ARN segment Example value Meaning Common mistake
service rds-db RDS IAM-auth service (not rds) Using rds: instead of rds-db:
region us-east-1 Region of the proxy Mismatched region
account 111122223333 Your account
resource type dbuser An IAM-auth DB user grant
resource id prx-0abc123def456 Proxy resource id, not DB id Using the cluster/instance id
db user app_user The exact DB username Wildcarding too broadly

7. Lambda integration, VPC networking, and cold-start storms

RDS Proxy and Lambda are made for each other, but the wiring has sharp edges.

VPC and security groups. The proxy lives in private subnets. The chain of security groups must allow: Lambda SG → proxy SG on 5432, and proxy SG → database SG on 5432. The proxy’s own SG is the trust boundary for the database; the database should accept connections from the proxy SG, not from the Lambda SG directly, so that the proxy is the only path in.

The exact security-group chain — three rules, no more:

# Rule on Direction Source / dest Port Purpose
1 Lambda SG Egress Proxy SG 5432/3306 Lambda can reach the proxy
2 Proxy SG Ingress Lambda SG 5432/3306 Proxy accepts Lambda clients
3 Proxy SG Egress DB SG 5432/3306 Proxy can reach the database
4 DB SG Ingress Proxy SG only 5432/3306 DB trusts only the proxy (not Lambda)

Get the connection lifecycle right. The classic Lambda mistake is opening a connection inside the handler and never reusing it. Open the connection (and fetch the IAM token) in the module/init scope so it is reused across warm invocations on the same execution environment:

import os, boto3, psycopg2

rds = boto3.client("rds")
HOST = os.environ["PROXY_HOST"]
USER = os.environ["DB_USER"]

def _connect():
    token = rds.generate_db_auth_token(
        DBHostname=HOST, Port=5432, DBUsername=USER, Region=os.environ["AWS_REGION"]
    )
    return psycopg2.connect(
        host=HOST, port=5432, user=USER, dbname=os.environ["DB_NAME"],
        password=token, sslmode="verify-full",
        sslrootcert="/var/task/rds-combined-ca-bundle.pem",
    )

conn = _connect()   # module scope: reused across warm invocations

def handler(event, context):
    global conn
    try:
        with conn.cursor() as cur:
            cur.execute("SELECT 1")
            return cur.fetchone()[0]
    except psycopg2.OperationalError:
        conn = _connect()   # reconnect on a dropped backend (e.g. after failover)
        with conn.cursor() as cur:
            cur.execute("SELECT 1")
            return cur.fetchone()[0]

The Lambda connection-handling rules and what each prevents:

Practice Do Don’t What it prevents
Connection scope Open in init/module scope Open inside the handler each call Per-invocation connect storms
Token minting Mint in init; refresh on reconnect Mint on every query Wasted SigV4 calls; latency
Reconnect Catch OperationalError, reconnect Assume the socket survives failover Stuck connections post-failover
Idle reaping Set idle_client_timeout on proxy Leave dead clients forever Pool slots held by dead Lambdas
Reserved concurrency Cap to protect the backend Let unbounded concurrency fan out Borrow-timeout saturation
Transactions Keep short; commit promptly Hold a transaction across awaits Pinning / backend hogging

Why the proxy specifically helps cold starts. When Lambda scales out hard, hundreds of new execution environments each open a connection in init. Without the proxy that is a direct connection storm against the database. The proxy absorbs it: client connections land on the proxy and borrow from the warm pool, so the database sees a bounded number of backends no matter how wide Lambda fans out. Keep max_idle_connections_percent high enough that the warm pool is ready for the next burst rather than cold-connecting under the spike.

Token, not password, in the environment. Note there is no PGPASSWORD env var holding a secret — only the IAM token minted per connection. The Lambda execution role carries the rds-db:connect permission from section 6.

Architecture at a glance

Read the diagram left to right and you are reading a single client request as it actually travels. On the far left a serverless tier — a Lambda fleet and ECS/Fargate tasks — opens many short-lived TCP connections, each over TLS 1.2 and each carrying a freshly minted IAM token instead of a password (badge 1). Those connections do not land on the database; they land on the RDS Proxy in your private subnets on port 5432. Before the proxy lets a client in, it consults the control plane: IAM checks the principal holds rds-db:connect for that proxy resource id and user, and the proxy itself uses a Secrets Manager secret to authenticate its own backend connections to Aurora — two independent layers in one hop. Inside the proxy, the connection pool (badge 2) is where multiplexing lives or dies: it hands a warm backend to whichever client is mid-transaction and returns it on commit, unless a SET, temp table, or session lock forces it to pin that backend 1:1. When demand outruns the pool, clients queue at the proxy endpoint and then error at the borrow timeout (badge 3) — the backpressure valve.

From the proxy, pooled SQL flows to the Aurora cluster: writes to the writer (badge 4, where an AUTH_FAILURE on the backend secret surfaces as an UNAVAILABLE target), reads to the read-only endpoint and its readers. When the writer dies, the reader is promoted and the proxy holds the client socket and re-routes it (badge 5) — the application feels a stall, not a reconnect storm. Every hop emits to CloudWatch, and the two numbers that matter — DatabaseConnectionsBorrowLatency and DatabaseConnectionsCurrentlySessionPinned — are what tell you, before users do, whether the pool is healthy or has quietly degraded to passthrough. The whole method is in the legend: localise the symptom to a badge, read the confirm command, apply the fix.

RDS Proxy production architecture: a serverless tier of Lambda and ECS or Fargate opening many short-lived TLS connections with IAM tokens, authenticating through an IAM and Secrets Manager control plane into an RDS Proxy in private subnets on port 5432, where a pin-aware connection pool multiplexes clients onto a warm pool capped at 75 percent of max_connections and exposes a read-write and a read-only endpoint, forwarding pooled SQL to an Aurora Multi-AZ cluster with a writer and a promotable reader, with every hop emitting BorrowLatency and Pinned-connection metrics to CloudWatch, and numbered badges marking IAM-token rejection, connection pinning, borrow-timeout saturation, backend AUTH_FAILURE, and held-socket failover

Real-world scenario

A retail platform team — call them Karthik’s group at NimbusCart — ran order-processing on Lambda against a provisioned Aurora PostgreSQL cluster (db.r6g.2xlarge, ~3400 max_connections). They had already put RDS Proxy in front of the writer and considered the connection problem solved. During a Diwali flash sale, throughput tripled, and they watched DatabaseConnectionsBorrowLatency spike into the seconds while the database’s own connection count sat far below capacity — the pool was saturated even though the database was not. Clients were timing out on connection_borrow_timeout and Lambda was retrying, amplifying the load. Monthly spend on the cluster was about ₹95,000 and climbing because someone had already started a vertical-scale change ticket.

The constraint: they could not simply raise max_connections_percent, because the backends were not idle and available — they were pinned. DatabaseConnectionsCurrentlySessionPinned was tracking nearly 1:1 with client connections, so the proxy had silently degraded to passthrough and every Lambda invocation was effectively holding a dedicated backend for its whole lifetime. The proxy they were paying for was doing nothing the database couldn’t have done by being directly connected.

Enabling debug logging surfaced the cause in the pinning logs: the ORM issued SET search_path on every connection, and an audit path used a session-level advisory lock. Both pin. The fix was three moves, none of them “buy a bigger database”:

-- 1. Move search_path off the runtime SET and onto the role so the proxy never sees it
ALTER ROLE app_user SET search_path = orders, public;
# 2. Restore headroom now that connections multiplex again
connection_pool_config {
  max_connections_percent      = 80
  max_idle_connections_percent = 60
  connection_borrow_timeout    = 20   # fail fast, let Lambda retry, shed load cleanly
}

They also replaced the session advisory lock with a transaction-scoped pg_advisory_xact_lock, which releases at commit and does not pin. After the change, DatabaseConnectionsCurrentlySessionPinned dropped to near zero, borrow latency fell back into single-digit milliseconds, and the same cluster absorbed the next sale at double the concurrency without touching the instance class. They closed the vertical-scale ticket and the bill stayed flat. The lesson the team took away: with RDS Proxy, the metric that predicts an outage is pinning, not CPU.

The incident as a timeline, because the order of moves is the lesson:

Time Symptom Action taken Effect What it should have been
T+0 BorrowLatency spiking, DB conns low (alert fires) Ask: are backends pinned?
T+5m Lambda timeouts climbing Considered raising max_connections_percent Would not help — backends pinned Check ...SessionPinned first
T+12m Vertical-scale ticket opened Proposed db.r6g.4xlarge Cost spike, wouldn’t fix pinning Don’t scale to mask
T+20m Root cause found Enabled debug_logging, read pinning reasons SET search_path + session lock named The breakthrough
T+30m Mitigated ALTER ROLE ... SET search_path; restore pool headroom Pinned → near zero Correct fix
T+45m Verified Replace lock with pg_advisory_xact_lock BorrowLatency back to single-digit ms The actual fix is query patterns
+1 sale Fixed Same cluster, 2× concurrency Zero pinning, flat bill Closed the scale ticket

Advantages and disadvantages

RDS Proxy is the right tool for serverless and high-fan-out workloads, but it is not free and not transparent. Weigh it honestly:

Advantages Disadvantages
Absorbs connection storms — bounded backends regardless of client fan-out You pay per vCPU-hour of the fronted instance class, always-on
Accelerates failover by holding client sockets (no reconnect herd) Adds a network hop and a small latency tax per borrow
IAM auth deletes long-lived DB passwords; tokens expire in 15 min Misconfigured rds-db:connect/rds_iam causes opaque auth failures
Managed, autoscaling, VPC-native — no PgBouncer fleet to operate Less tunable than self-managed PgBouncer (fewer pool modes)
Integrated with Secrets Manager rotation Pinning silently degrades it to passthrough — easy to miss
Multiplexing lets a tiny pool serve thousands of clients Only helps short transactions; long/reporting queries pin or hog
Read-only endpoints extend pooling + failover to reads Reader endpoints still serve stale reads (replica lag unchanged)

The model is right when you have spiky or serverless traffic, want failover felt as a stall, and want to retire database passwords. It is a poor fit — or a waste of money — for steady, low-concurrency workloads with a single always-on app server (a direct connection or an in-process pool is cheaper), for batch/reporting workloads dominated by long transactions (which pin), and for cases where the team won’t audit and eliminate pinning (you pay for a pooler that passes through). When to reach for RDS Proxy versus the alternatives:

If your workload is… Best choice Why
Lambda / serverless, spiky RDS Proxy Absorbs storms; IAM auth; managed
Fargate/ECS microservices, many tasks RDS Proxy Shared pool across tasks
One always-on app server, steady In-process pool (HikariCP/pgbouncer-local) Cheaper; no extra hop
Need exotic pool modes / per-tenant pools Self-managed PgBouncer More control than the proxy offers
Batch / ETL with long transactions Direct connection Pooling adds nothing; would pin
Want failover felt as a stall, any tier RDS Proxy Held sockets beat reconnect herds

Hands-on lab

Provision a proxy in front of an Aurora PostgreSQL cluster, connect through it with an IAM token over TLS, prove multiplexing, and tear down. This uses small instance classes but is not free-tier — Aurora and the proxy bill per hour; budget a few rupees and delete at the end. Run in CloudShell (which already has credentials and the RDS CA bundle reachable).

Step 1 — Variables.

export AWS_REGION=us-east-1
RG_PREFIX=rdsproxy-lab
VPC=$(aws ec2 describe-vpcs --filters Name=isDefault,Values=true --query "Vpcs[0].VpcId" --output text)
SUBNETS=$(aws ec2 describe-subnets --filters Name=vpc-id,Values=$VPC --query "Subnets[0:2].SubnetId" --output text)
echo "VPC=$VPC SUBNETS=$SUBNETS"

Step 2 — Create the backend secret. (Use a real cluster you already have, or create a tiny Aurora cluster first; here we assume prod-aurora exists.)

cat > /tmp/secret.json <<'JSON'
{ "username": "app_proxy_user", "password": "ChangeMe-Strong-1", "engine": "postgres" }
JSON
SECRET_ARN=$(aws secretsmanager create-secret --name $RG_PREFIX/proxy-user \
  --secret-string file:///tmp/secret.json --query ARN --output text)
echo "SECRET_ARN=$SECRET_ARN"

Expected: an ARN ending in random suffix characters.

Step 3 — Create the proxy IAM role (trust rds.amazonaws.com, allow read of the secret). Create the trust + permission policy, then the role, then attach. Confirm with:

aws iam get-role --role-name $RG_PREFIX-role --query "Role.Arn" --output text

Step 4 — Create the proxy and target group.

PROXY_ARN=$(aws rds create-db-proxy \
  --db-proxy-name $RG_PREFIX-proxy --engine-family POSTGRESQL \
  --role-arn $(aws iam get-role --role-name $RG_PREFIX-role --query Role.Arn --output text) \
  --vpc-subnet-ids $SUBNETS --require-tls \
  --auth "[{\"AuthScheme\":\"SECRETS\",\"SecretArn\":\"$SECRET_ARN\",\"IAMAuth\":\"REQUIRED\"}]" \
  --query "DBProxy.DBProxyArn" --output text)
echo "PROXY_ARN=$PROXY_ARN"

Then register the cluster as a target:

aws rds register-db-proxy-targets --db-proxy-name $RG_PREFIX-proxy \
  --db-cluster-identifier prod-aurora

Step 5 — Wait for the target to go AVAILABLE.

aws rds describe-db-proxy-targets --db-proxy-name $RG_PREFIX-proxy \
  --query "Targets[].{Target:RdsResourceId,State:TargetHealth.State,Reason:TargetHealth.Reason}" --output table

Expected: State = AVAILABLE. If AUTH_FAILURE, the secret password is wrong or the role can’t read it — fix and re-check.

Step 6 — Grant the DB user IAM auth (run once, connected as an admin):

CREATE USER app_user;
GRANT rds_iam TO app_user;
GRANT CONNECT ON DATABASE appdb TO app_user;

Step 7 — Connect through the proxy with an IAM token over TLS.

PROXY_HOST=$(aws rds describe-db-proxies --db-proxy-name $RG_PREFIX-proxy \
  --query "DBProxies[0].Endpoint" --output text)
export PGPASSWORD="$(aws rds generate-db-auth-token --hostname $PROXY_HOST --port 5432 \
  --username app_user --region $AWS_REGION)"
psql "host=$PROXY_HOST port=5432 user=app_user dbname=appdb sslmode=verify-full \
      sslrootcert=/etc/ssl/certs/rds-combined-ca-bundle.pem" -c "select current_user, now();"

Expected: a row with app_user and a timestamp — proof the IAM-token + TLS path works.

Step 8 — Prove multiplexing. Open a quick load loop in another shell (a handful of concurrent psql sessions running SELECT pg_sleep(0.05) in a loop), then read the proxy metrics in CloudWatch and compare ClientConnections against DatabaseConnectionsCurrentlyBorrowed — the former should exceed the latter.

The lab steps mapped to what each proves:

Step What you did What it proves Real-world analogue
4 Create proxy with IAMAuth=REQUIRED Two-layer auth is configured The production provisioning
5 Wait for AVAILABLE The proxy can authenticate to the DB The first health gate
6 GRANT rds_iam The DB user is IAM-only Deleting the password
7 Connect with a token + verify-full The passwordless TLS path works Every app connection
8 Compare client vs borrowed Multiplexing is real The whole point of the proxy

Cleanup (avoid lingering hourly charges).

aws rds delete-db-proxy --db-proxy-name $RG_PREFIX-proxy
aws secretsmanager delete-secret --secret-id $RG_PREFIX/proxy-user --force-delete-without-recovery
aws iam delete-role --role-name $RG_PREFIX-role   # detach/delete inline policies first

Cost note. The proxy bills per vCPU-hour of the fronted instance class plus a small per-hour charge; an hour of this lab is a few rupees. Aurora itself is the bigger line item — delete any cluster you spun up just for the lab.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. Start with the fast decision table to land on the right row, then the full playbook, then the entries that bite hardest with confirm-command detail.

If you see… It’s probably… Do this first
BorrowLatency high, DB conns low Pinning Check ...SessionPinned; kill the pin
Target UNAVAILABLE Backend auth/connectivity describe-db-proxy-targets → read the reason
PAM authentication failed IAM not wired end to end Check rds-db:connect ARN + rds_iam grant
Cert / TLS error on connect sslmode or CA bundle wrong Use verify-full + RDS CA bundle
remaining connection slots reserved Something bypasses the proxy Lock DB SG to proxy SG only
Reconnect flood at failover Direct connect / no reconnect Route via proxy; reconnect on error
RO endpoint has no targets No reader in the cluster Add a reader, or drop the endpoint
Stale reads via RO endpoint Replica lag (not the proxy) Read-after-write at the RW endpoint

The full playbook table you read mid-incident:

# Symptom Root cause Confirm (exact cmd / console path) Fix
1 BorrowLatency spikes, DB connection count is low Pinning — backends dedicated 1:1, pool starved CloudWatch DatabaseConnectionsCurrentlySessionPinnedClientConnections; proxy logs pinned Eliminate SET/temp/locks; ALTER ROLE for search_path
2 Target UNAVAILABLE, reason AUTH_FAILURE Wrong secret creds, or role can’t read the secret aws rds describe-db-proxy-targets --query "Targets[].TargetHealth" Fix secret value; grant GetSecretValue + kms:Decrypt
3 Clients error timed out waiting for a connection Pool saturated; connection_borrow_timeout hit DatabaseConnectionsBorrowLatency rising; pool maxed Raise max_connections_percent (within headroom); cut pinning
4 IAM login fails PAM authentication failed Principal lacks rds-db:connect, or user not rds_iam aws iam simulate-principal-policy for rds-db:connect; check \du for rds_iam Add rds-db:connect on prx-*/user; GRANT rds_iam
5 TLS connection refused / cert error sslmode wrong or missing RDS CA bundle psql ... sslmode=verify-full sslrootcert=... errors Use verify-full + rds-combined-ca-bundle.pem
6 DB still hits remaining connection slots reserved App bypasses the proxy and connects direct Audit SGs; DB ingress allows Lambda SG directly DB SG ingress = proxy SG only; route all via proxy
7 Read-only endpoint has no targets No reader in the Aurora cluster Target reason INVALID_REPLICATION_STATE Add ≥1 reader; or drop the RO endpoint
8 Failover still causes a reconnect flood App connects direct, or doesn’t reconnect App logs show connection refused burst at failover Route through proxy; catch OperationalError, reconnect
9 Lambda opens a connection per invocation Connection created in handler, not init scope Code review; ClientConnections spikes with invokes Move connect to module/init scope; reuse
10 Secret rotation breaks the proxy Rotated secret out of sync, or extra-grace not set Targets flip to AUTH_FAILURE right after rotation Use Secrets Manager rotation that keeps both valid during window
11 Multiplexing ratio ~1:1 despite no obvious SET Server-side prepared statements pinning Proxy logs pinned reason “prepared statement” prepareThreshold=0 / simple query mode
12 kms:Decrypt denied on the secret KMS key policy / ViaService condition wrong CloudTrail Decrypt AccessDenied for the proxy role Allow kms:Decrypt with kms:ViaService for Secrets Manager
13 Intermittent connection drops under steady load idle_client_timeout too low reaping live clients Drops align with the idle timeout value Raise idle_client_timeout to match keep-alive
14 Reads return stale data through RO endpoint Replica lag, not a proxy problem aws rds describe-db-clusters replica lag metric Read-after-write at RW endpoint; accept lag for RO

The expanded form, with the full reasoning for the entries that cause the most wasted hours:

1. BorrowLatency spikes while the database’s own connection count is low. Root cause: Connection pinning — backends are dedicated 1:1 to clients, so the pool can’t multiplex and clients queue for a free backend that never comes. Confirm: DatabaseConnectionsCurrentlySessionPinned tracking near ClientConnections in CloudWatch; enable debug_logging and grep the proxy log for pinned to get the exact reason. Fix: Move SET search_path to ALTER ROLE; replace session advisory locks with pg_advisory_xact_lock; disable server-side prepared statements; avoid session temp tables in hot paths.

2. The proxy target sits UNAVAILABLE with reason AUTH_FAILURE. Root cause: The proxy cannot log into the database with the Secrets Manager credentials — wrong password in the secret, wrong username, or the proxy role can’t read/decrypt the secret. Confirm: aws rds describe-db-proxy-targets --db-proxy-name <p> --query "Targets[].{State:TargetHealth.State,Reason:TargetHealth.Reason}". Fix: Correct the secret value; ensure the role allows secretsmanager:GetSecretValue and kms:Decrypt (with the kms:ViaService condition); verify the DB user/password actually exist.

3. Clients error “timed out waiting for a connection.” Root cause: The pool is saturated and clients waited past connection_borrow_timeout. Either offered load exceeds max_connections_percent, or pinning has shrunk the effective pool. Confirm: DatabaseConnectionsBorrowLatency climbing toward the timeout; pinned-connections high. Fix: First eliminate pinning (it’s usually the real cause); then raise max_connections_percent within the headroom budget; for Lambda, keep the timeout short so it fails fast and retries rather than hanging.

4. IAM login fails with PAM authentication failed for user. Root cause: The calling principal lacks rds-db:connect for the proxy resource id + user, or the database user wasn’t granted rds_iam. Confirm: aws iam simulate-principal-policy --policy-source-arn <principal> --action-names rds-db:connect --resource-arns arn:aws:rds-db:...:dbuser:prx-.../app_user; in psql, \du shows whether the role has rds_iam. Fix: Add rds-db:connect scoped to prx-*/app_user; GRANT rds_iam TO app_user; confirm the ARN uses the proxy resource id, not the cluster id.

6. The database still hits remaining connection slots are reserved. Root cause: Something is bypassing the proxy and connecting directly — a forgotten cron box, an admin tool, or the app’s old direct connection string still in config. Confirm: Audit the DB security group; if its ingress allows the Lambda/app SG directly (not just the proxy SG), direct connections are possible. Check pg_stat_activity for client addresses that aren’t the proxy. Fix: Set the DB SG ingress to allow the proxy SG only; remove direct connection strings; route all traffic through the proxy so the pool is the single chokepoint.

10. Secret rotation suddenly flips targets to AUTH_FAILURE. Root cause: The rotation changed the password in Secrets Manager but the proxy/DB briefly disagree, or the rotation strategy doesn’t keep the old credential valid during the cutover window. Confirm: Targets go AUTH_FAILURE exactly at the rotation timestamp; CloudTrail shows the rotation Lambda invocation. Fix: Use a rotation strategy that keeps both the old and new credential valid during the window (alternating-users or a grace period), and ensure the proxy role can read the rotated version.

Best practices

The alarms worth wiring before the next incident — leading indicators, not “database down”:

Alert on Metric Threshold (starting point) Why it’s leading
Pinning DatabaseConnectionsCurrentlySessionPinned > 20% of ClientConnections for 5 min Multiplexing degrading to passthrough
Borrow latency DatabaseConnectionsBorrowLatency p95 > 50 ms sustained Clients queuing for backends
Pool saturation DatabaseConnectionsCurrentlyBorrowed > 90% of cap About to hit borrow-timeout errors
Client growth ClientConnections Sudden 2× jump Fan-out / storm forming
Backend health Target State not AVAILABLE for 2 min Auth or connectivity failure
Query latency QueryDatabaseResponseLatency > your SLO Backend slow vs proxy slow disambiguation

The pre-production ship gate — verify each before routing real traffic, with the exact check:

Gate Verify with Pass condition
Target healthy describe-db-proxy-targets State = AVAILABLE
IAM auth works psql with generate-db-auth-token + verify-full Returns current_user
Backend secret + KMS proxy role policy review GetSecretValue + kms:Decrypt present
Multiplexing real CloudWatch under load ClientConnections...Borrowed
Pinning near zero CloudWatch under load ...SessionPinned ≈ 0
Failover behaves failover-db-cluster drill Brief stall, no refusal flood
DB SG locked down SG ingress audit Allows proxy SG only
No direct password config / env / image scan No PGPASSWORD static value

Security notes

The security controls that also improve resilience — they pull in the same direction:

Control Mechanism Secures against Also prevents
IAM auth required iam_auth=REQUIRED + rds_iam Leaked/stolen DB passwords Stale-password outages
Least-privilege rds-db:connect Scoped resource ARN Lateral movement to other DB users Accidental cross-user access
Scoped secret + KMS read GetSecretValue + kms:ViaService Secret exfiltration Over-broad role blast radius
require_tls + verify-full Proxy TLS + client verify MITM / downgrade “Temporary” plaintext mistakes
DB SG = proxy SG only Security-group chain Direct DB exposure Bypass storms hitting max_connections
Safe secret rotation Alternating/grace rotation Stale credential window AUTH_FAILURE flapping at rotation

Cost & sizing

The bill drivers and how they interact with the design:

A rough monthly picture for a small-to-mid serverless API on Aurora PostgreSQL: the cluster (db.r6g.large writer + one reader) dominates at roughly ₹70,000–90,000/month; the RDS Proxy in front adds on the order of ₹6,000–10,000/month depending on the fronted class and hours; CloudWatch is a few hundred rupees. The proxy line is small next to the cluster and trivial next to the cost of a connection-exhaustion outage during a sale. The cost levers and what each buys:

Cost driver What you pay for Rough INR / month What it fixes Watch-out
RDS Proxy Per vCPU-hour of fronted class + per-hour ~₹6,000–10,000 Pooling, failover, IAM auth Always-on; needs spiky load to justify
Aurora writer (db.r6g.large) One writer instance-hour ~₹40,000–55,000 The database itself max_connections scales with size
Aurora reader (1×) One reader instance-hour ~₹30,000–40,000 Read scale + RO endpoint target Each reader is a full instance
Vertical scale to raise max_connections Bigger instance class +50–100% of cluster (avoid — pooling is cheaper) OOM risk; masks pinning
CloudWatch metrics Standard metric/alarm rate ~₹300–800 Visibility (pinning, latency) debug_logging spikes log ingestion
NAT/VPC endpoints (if used) Per-hour + per-GB ~₹1,500–3,000 Private path for token mint/secrets Only if Lambda has no internet route

Interview & exam questions

1. What problem does RDS Proxy solve that a bigger database instance does not? A bigger instance raises max_connections, but that scales with memory and is expensive and bounded; it doesn’t change the fact that serverless fan-out demands one backend per client. RDS Proxy multiplexes many short-lived client connections onto a small warm pool, so a smaller instance serves far more clients. It also accelerates failover and enables IAM auth — neither of which a bigger instance provides.

2. Explain transaction-level multiplexing and when it stops working. A backend connection is borrowed from the pool when a client begins a transaction and returned when it commits/rolls back, so between transactions the client holds no backend and one backend serves many clients. It stops working when the client mutates session state (e.g. SET search_path, temp tables, session locks, server-side prepared statements), which forces the proxy to pin the backend 1:1.

3. What is connection pinning and how do you detect it? Pinning is the proxy dedicating a backend to one client for the life of its session because the connection carries session state unsafe to share. Detect it via the DatabaseConnectionsCurrentlySessionPinned CloudWatch metric (high relative to ClientConnections) and by enabling debug_logging and reading the pinned reason in the proxy logs.

4. Distinguish the two authentication layers in an IAM-auth RDS Proxy setup. Clients authenticate to the proxy using a 15-minute SigV4 IAM token (controlled by iam_auth = REQUIRED and the principal’s rds-db:connect permission). The proxy authenticates to the database using credentials from Secrets Manager (secret_arn). They are independent — the proxy always uses the secret for backend connections regardless of how clients authenticate.

5. Why does RDS Proxy make failover faster from the application’s point of view? It holds the client’s connection open during failover, absorbs the reconnect internally, and re-routes the held connection to the newly promoted writer once DNS flips. The application sees a brief stall on in-flight transactions instead of a fleet-wide reconnect storm of connection refused errors.

6. How do you size max_connections_percent, and what breaks if you get it wrong? Set it so that all consumers’ percentages (the proxy, other proxies, admin tools, replication, direct connections) sum to under 100 of the target’s max_connections with margin. Too low and clients hit connection_borrow_timeout; too high (or summing past 100) and the database hits remaining connection slots are reserved.

7. A target is UNAVAILABLE with reason AUTH_FAILURE. What’s wrong and how do you fix it? The proxy can’t authenticate to the database with the Secrets Manager credentials — wrong password/username in the secret, or the proxy role can’t GetSecretValue/kms:Decrypt. Fix the secret value and grant the role read+decrypt (with the kms:ViaService condition). It is a proxy→database problem, not a client→proxy one.

8. How do you delete the long-lived database password from a Lambda function? Set iam_auth = REQUIRED on the proxy, grant the database user rds_iam, and give the Lambda execution role rds-db:connect scoped to the proxy resource id and that user. The function mints a SigV4 token at connect time and uses it as the password; no static password is stored. Use sslmode=verify-full with the RDS CA bundle.

9. Why must clients use sslmode=verify-full rather than require? require encrypts the channel but does not verify the server’s certificate or hostname, so a man-in-the-middle can present its own cert and intercept traffic. verify-full validates the proxy’s certificate against the RDS CA bundle and checks the hostname, defeating MITM. Encryption without verification is not authentication.

10. Your Lambda opens a new connection on every invocation. What’s the impact and the fix? Each cold-start environment opening (and worse, each invocation opening) a connection creates a connection storm and wastes SigV4 token mints and latency. Open the connection (and mint the token) in module/init scope so it’s reused across warm invocations, and reconnect on OperationalError to survive failover.

11. When is RDS Proxy the wrong choice? For a steady, low-concurrency workload with a single always-on app server, an in-process pool is cheaper and adds no hop. For batch/reporting workloads dominated by long transactions, pooling adds nothing (those pin). And if the team won’t audit and eliminate pinning, you pay per vCPU-hour for a pooler that has degraded to passthrough.

12. How does a read-only proxy endpoint behave, and what does it require? It routes read traffic to the cluster’s readers (load-balanced) with the same pooling and failover handling, created with --target-role READ_ONLY. It requires at least one reader in the Aurora cluster (otherwise the target is INVALID_REPLICATION_STATE), and it does not change replication semantics — reads can still be stale by the replica lag.

These map primarily to AWS Certified Database – Specialty (connection management, RDS/Aurora HA, security) and Solutions Architect – Professional (resilient, secure data tiers for serverless). The IAM-auth and least-privilege angle touches Security – Specialty. A compact cert-mapping for revision:

Question theme Primary cert Objective area
Pooling, multiplexing, pinning Database – Specialty Connection management & performance
Failover acceleration, RO endpoints Database – Specialty / SAP HA & resilience for data stores
IAM auth, rds-db:connect, rds_iam Security – Specialty Identity & access for data services
Secrets Manager + KMS for the proxy Security – Specialty Data protection & secrets
Lambda/VPC wiring, SG chain SAP / Database – Specialty Serverless data access patterns
Sizing max_connections_percent Database – Specialty Capacity & quotas

Quick check

  1. Your DatabaseConnectionsBorrowLatency is spiking but the database’s own connection count is well below max_connections. What’s the most likely cause, and which single metric confirms it?
  2. True or false: scaling the database vertically to raise max_connections is the correct fix when a proxy’s clients are timing out on borrow.
  3. In an IAM-auth setup, which credential does the proxy use to connect to the database, and which does the client use to connect to the proxy?
  4. Your proxy target is UNAVAILABLE with reason AUTH_FAILURE. Name two things to check.
  5. Why is sslmode=require insufficient and what should you use instead?

Answers

  1. Connection pinning — backends are dedicated 1:1, so the pool can’t multiplex and clients queue for a free backend. Confirm with DatabaseConnectionsCurrentlySessionPinned; if it tracks near ClientConnections, multiplexing is effectively off.
  2. False. If clients are timing out because of pinning, a bigger instance doesn’t help — the backends aren’t idle, they’re pinned. Eliminate pinning first (move SET to the role, use transaction-scoped locks, disable server-side prepared statements); only then consider max_connections_percent headroom.
  3. The proxy authenticates to the database with credentials from Secrets Manager (secret_arn). The client authenticates to the proxy with a short-lived IAM SigV4 token (iam_auth = REQUIRED, principal holds rds-db:connect). Independent layers.
  4. Check (a) the secret value is correct (right username/password for a user that exists in the DB) and (b) the proxy role can secretsmanager:GetSecretValue and kms:Decrypt the secret (with the kms:ViaService condition). Verify with describe-db-proxy-targets.
  5. require encrypts but does not verify the server certificate or hostname, so it’s vulnerable to MITM. Use verify-full with the RDS CA bundle so the client validates the proxy’s cert and hostname.

Glossary

Next steps

You can now put a managed pooler in the path, keep it multiplexing, make failover a stall, and delete the database password. Build outward:

awsrdsaurorards-proxyconnection-poolinglambda
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments