Secrets Manager Rotation at Scale: Custom Rotation Lambdas, RDS Credentials, and Cross-Account Sharing

Storing a secret in AWS Secrets Manager is the easy part. The thing that actually reduces blast radius is rotation — and rotation is where teams quietly fail, because a botched cycle either locks an application out of its own database or, worse, succeeds silently while applications keep using a credential that no longer works. The mechanics are precise: one staging-label state machine you must respect, a versioning model you cannot shortcut, and networking/KMS permissions that are unforgiving when wrong. Get one of those three wrong and the failure mode is the same flavour every time — a Lambda that times out with no useful error, a secret stuck in AWSPENDING, and a 02:00 pager.

This guide walks the four-step rotation model end to end, contrasts single-user and alternating-user RDS strategies, shows how to write a custom rotator for non-RDS credentials, and covers the cross-account sharing platform teams always end up needing. It is built as a reference you keep open mid-incident: read the prose once, then keep the tables — the step contract, the IAM grants, the VPC/KMS requirements, the failure playbook — open at 02:00. Every operation gets both the aws CLI and the IaC (Terraform) form, because the console wires permissions you will forget to declare in code.

By the end you will stop guessing. When rotation fails you will know whether it is a non-idempotent step, a missing Secrets Manager VPC endpoint, a master secret that cannot clone the user, an aws/secretsmanager-encrypted secret that can never go cross-account, or a consumer whose identity policy was never granted even though the resource policy was. Knowing which in ninety seconds is what separates a five-minute incident from a two-hour one.

What problem this solves

A long-lived database password is a standing liability. It leaks through a log line, a .env committed by accident, a laptop image, an offboarded contractor who still remembers it. The only durable mitigation is to make the credential short-lived — rotate it on a schedule so that any copy an attacker holds expires on its own. Secrets Manager automates that, but the automation has sharp edges, and the edges are exactly where production breaks.

What breaks without getting this right: an engineer enables single-user rotation on a hot-path credential and every cycle produces a spike of authentication errors as forty services slowly notice the password changed; a rotation Lambda is dropped into the database VPC but never given a Secrets Manager endpoint, so it times out reaching the API and the secret sticks half-rotated in AWSPENDING; a central platform team shares a secret cross-account using the AWS-managed key and the consumer account gets AccessDenied on kms:Decrypt forever, because that key’s policy cannot be edited; a finishSecret step is non-idempotent, Secrets Manager retries it, and the staging labels desync. Each of these is perfectly diagnosable and each costs a team an afternoon the first time.

Who hits this: anyone running RDS/Aurora, DocumentDB, Redshift, RDS Proxy, or any credentialed third-party service at scale. It bites hardest on multi-account organisations (cross-account KMS is non-obvious), VPC-isolated databases (the networking is the silent killer), high-traffic hot paths (single-user rotation windows show up as real error budget), and custom integrations where AWS publishes no managed rotation function and you must write all four steps yourself.

To frame the whole field before the deep dive, here is every failure class this article covers, the question it forces, and the one place to look first:

Failure class	What you observe	First question to ask	First place to look	Most common single cause
Rotation Lambda timeout	`RotationFailed`, stuck `AWSPENDING`	Can the Lambda reach BOTH the DB and the SM API?	CloudWatch Logs for the rotator	No Secrets Manager VPC endpoint in private subnets
Rotation-window auth errors	5xx spike every cycle	Single-user or alternating-user?	`VersionIdsToStages` + app logs	Single-user on a hot path
`setSecret` permission denied	Logs stop at step 2	Can the function clone / `ALTER USER`?	Rotator logs, DB grants	Missing `masterarn` (alternating-user)
Cross-account `AccessDenied`	Consumer can’t `GetSecretValue`	Resource policy AND KMS AND identity policy all allow?	`simulate-principal-policy`	`aws/secretsmanager` key (uneditable) cross-account
Stale / never-rotated secret	Believed covered, never rotated	Is rotation even enabled and succeeding?	AWS Config rules, `LastRotatedDate`	Rotation disabled or silently failing
App uses dead credential	Auth fails after a clean rotation	Is the app pinned to a `VersionId`?	App config / cache code	Pinned version instead of `AWSCURRENT`

Learning objectives

By the end of this article you can:

Explain the staging-label state machine (AWSCURRENT / AWSPENDING / AWSPREVIOUS) and why rotation is a relabel, never an in-place edit.
Implement the four-step rotation Lambda (createSecret → setSecret → testSecret → finishSecret) with each step correctly idempotent against Secrets Manager’s retries.
Choose between single-user and alternating-user RDS rotation and explain exactly why alternating-user eliminates the invalid-credential window.
Stand up the VPC, security-group, and Secrets Manager endpoint wiring a rotation Lambda needs to reach a private database without timing out.
Author the IAM execution role, Lambda resource policy, and KMS grants rotation requires — and avoid the aws/secretsmanager-key cross-account trap.
Write a custom rotator for a third-party credential where the provider issues the secret instead of accepting one, collapsing createSecret/setSecret safely.
Share a secret cross-account so both the resource policy and the consumer’s identity policy allow, and rotation keeps flowing to consumers automatically.
Integrate consumers with the caching client so they follow AWSCURRENT without rotation-window failures, and monitor rotation health and staleness with EventBridge and AWS Config.

Prerequisites & where this fits

You should already understand the Secrets Manager basics — that a secret is a named, versioned, KMS-encrypted blob you read with GetSecretValue, billed per secret per month plus per 10,000 API calls. You should be comfortable with IAM identity-based vs resource-based policies, KMS key policies, and the difference between an AWS-managed key (aws/secretsmanager) and a customer-managed key (CMK). Basic VPC literacy (private subnets, security groups, interface endpoints) and the ability to run aws CLI v2 and read JSON output are assumed.

This sits at the intersection of the Security and Databases tracks. It builds directly on the AWS Secrets Manager & Parameter Store Deep Dive (storage, naming, versioning fundamentals) and the AWS KMS Encryption Deep Dive (key policies, grants, envelope encryption — the cross-account half of this article is a KMS problem). On the database side it pairs with the AWS RDS & Aurora Deep Dive and especially RDS Proxy: Connection Pooling, Failover, IAM Auth, because RDS Proxy is what turns rotation cutover into a non-event. The cross-account sharing pattern leans on IAM Cross-Account Roles, External ID, Confused Deputy and Multi-Region KMS Keys & Envelope Encryption. The rotation engine itself is a Lambda running inside the database VPC.

A quick map of who owns what during a rotation incident, so you call the right person fast:

Layer	What lives here	Who usually owns it	Failure classes it can cause
Secret + staging labels	The versioned value, `AWSCURRENT`/`AWSPENDING`	App / platform	Stuck pending, desynced labels
Rotation Lambda	The four-step state machine	App / platform	Non-idempotent step, timeout
VPC / SG / endpoints	Network path to DB and SM API	Network team	Lambda timeout (no route)
IAM role + resource policy	Who may rotate / invoke	Security / platform	Permission denied mid-cycle
KMS key + key policy	Encryption + cross-account decrypt	Security / KMS admin	Cross-account `AccessDenied`
RDS / Aurora	The backing user(s), master secret	DBA / platform	`ALTER USER` fails, no clone
Consumer app	Cache, refresh, version pinning	Consuming team	Uses dead credential after rotation

Core concepts

Five mental models make every later diagnosis obvious.

Rotation is a relabel, never an in-place edit. A secret holds one or more versions, each carrying one or more staging labels. Rotation creates a new version, parks it under AWSPENDING, proves it works, then atomically moves AWSCURRENT onto it. Applications reading AWSCURRENT (the default) never see the pending value until cutover completes — that is the entire reason rotation can be zero-downtime. If you ever find yourself “updating” a secret value during rotation, you are off the rails.

Secrets Manager drives your Lambda four times. It invokes your function once per step (createSecret, setSecret, testSecret, finishSecret), passing the step name and the new version’s ID (ClientRequestToken). Your function is a dispatcher on that field. Secrets Manager retries on failure, so every step must be idempotent — running twice must be safe.

RDS rotation has two strategies, and the choice is operational, not cosmetic. Single-user changes the password on the same user every cycle and has a brief window where the old credential is invalid. Alternating-user keeps two users and swaps between them, so the promoted credential always belonged to a user that already existed with a known-good password — no invalid window. The alternating strategy needs a second, elevated master secret to clone the user.

The rotation Lambda must reach two endpoints, and a VPC Lambda loses easy egress. It must reach the database (security-group access on the DB port) and the Secrets Manager API. A Lambda attached to a VPC has no default internet egress, so it needs either a Secrets Manager interface VPC endpoint or a NAT path to call the API. Forgetting this is the single most common rotation failure, and it presents as a bare timeout.

Cross-account sharing requires three independent allows. The owning account’s secret resource policy must grant the consumer; the KMS key must permit the consumer to Decrypt (and it must be a CMK — the managed key cannot be shared); and the consumer’s own identity-based policy must allow GetSecretValue and kms:Decrypt. All three must be true or it silently fails, because each account authorises independently.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters to rotation
Version	An immutable copy of the secret value	The secret	Rotation writes a new one each cycle
Staging label	A movable pointer to one version	On a version	`AWSCURRENT`/`AWSPENDING` drive cutover
`AWSCURRENT`	The version apps get by default	One version	What consumers should always read
`AWSPENDING`	The in-flight version being tested	One version (transient)	Lingering = a step failed mid-cycle
`AWSPREVIOUS`	The version that was current	One version	One-generation rollback
Rotation Lambda	The four-step state machine	Lambda (often in VPC)	The engine; must be idempotent
`Step`	Which of the four phases is running	Invocation event	Your dispatcher switches on it
`masterarn`	Pointer to the elevated master secret	Secret JSON	Required for alternating-user
Single-user	Rotate the same DB user	Rotation strategy	Brief invalid-credential window
Alternating-user	Swap between two DB users	Rotation strategy	No invalid window (production-grade)
SM VPC endpoint	Interface endpoint to the SM API	Private subnets	Lets a VPC Lambda call the API
Resource policy	Who else may read this secret	On the secret	Cross-account / cross-principal grant
CMK	Customer-managed KMS key	KMS	The only key shareable cross-account

1. The versioning and staging-label model

Everything in rotation hinges on staging labels. A secret holds one or more versions, each carrying labels. Three are reserved and load-bearing:

AWSCURRENT — the version applications get by default when they call GetSecretValue without specifying a version.
AWSPENDING — the in-flight version being created and tested during rotation. It does not exist between rotations.
AWSPREVIOUS — automatically applied to the version that was AWSCURRENT once rotation finishes, so you can roll back one generation.

A single version label points to one version at a time. Rotation is fundamentally the act of creating a new version labelled AWSPENDING, proving it works, then atomically moving AWSCURRENT onto it. Applications that always read AWSCURRENT (the default) never see the pending value until cutover completes. That is the whole reason rotation can be zero-downtime: the new credential is fully provisioned and tested before anything points production at it.

Mental model: rotation never edits a secret in place. It writes a new version, parks it under AWSPENDING, and only relabels AWSCURRENT at the very end. If you ever find yourself “updating” a secret value during rotation, you are off the rails.

The three reserved labels, what moves them, and what each means for callers:

Staging label	Set by	Points to	Exists between rotations?	What a caller gets
`AWSCURRENT`	`finishSecret` (relabel)	The live, in-use version	Yes (always exactly one)	Default value from `GetSecretValue`
`AWSPENDING`	`createSecret` (`PutSecretValue`)	The new version under test	No (transient)	Only if explicitly requested by stage
`AWSPREVIOUS`	Auto, when `AWSCURRENT` moves	The prior current version	Yes after first rotation	One generation back, for rollback
Custom label	You, via `UpdateSecretVersionStage`	Any version you choose	Yes if you set it	Whatever you pin it to (avoid for apps)

The version-stage invariants you must never violate — break one and rotation desyncs:

Invariant	Why it holds	What violating it causes
Exactly one version holds `AWSCURRENT`	Callers need one unambiguous live value	Apps read an indeterminate credential
`AWSPENDING` is removed once promoted	It only labels the in-flight version	Lingering pending blocks the next cycle
A version is immutable once written	Versions are content-addressed by token	“Editing” forks state from what was applied
`AWSPREVIOUS` is one generation only	Rollback target, not history	Treating it as an archive loses older values
Labels move atomically	Cutover must be all-or-nothing	Partial moves expose a half-rotated secret

Inspect the live label-to-version mapping at any time — this is your first diagnostic command in any rotation incident:

aws secretsmanager describe-secret --secret-id prod/payments/db-app \
  --query 'VersionIdsToStages'
# Healthy: one version → ["AWSCURRENT"], one → ["AWSPREVIOUS"], NO "AWSPENDING".
# A lingering "AWSPENDING" means a step failed mid-cycle.

2. The four-step rotation Lambda

Secrets Manager drives rotation by invoking your Lambda four times, passing a Step field in the event each time. Your function is a dispatcher on that field. The contract is fixed:

{
  "SecretId": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/app/db-AbCdEf",
  "ClientRequestToken": "uuid-of-the-new-version",
  "Step": "createSecret"
}

The ClientRequestToken is the version ID of the new (AWSPENDING) version. What each step must do, and the idempotency rule that keeps retries safe:

Step	Responsibility	Idempotency rule	Failure if you skip it
`createSecret`	Generate the new credential and store it as `AWSPENDING`.	If `AWSPENDING` already exists, do nothing.	Each retry regenerates → desync vs `setSecret`
`setSecret`	Apply the pending credential to the backing service (e.g. `ALTER USER`).	Must tolerate being run twice.	Double-apply or partial apply on retry
`testSecret`	Connect/authenticate using the pending credential.	Read-only validation.	A broken credential gets promoted
`finishSecret`	Move `AWSCURRENT` to the pending version.	Return early if already promoted.	Cutover repeats or strands the label

The four steps as a state machine — what is true before and after each, and who acts:

Phase	Before	Action	After	Actor
Start	One version: `AWSCURRENT`	SM allocates a new version ID	Token reserved for `AWSPENDING`	Secrets Manager
`createSecret`	No `AWSPENDING`	Generate value, `PutSecretValue`	New version labelled `AWSPENDING`	Your Lambda
`setSecret`	Pending exists, not yet live in DB	Apply to backing service	DB accepts the pending credential	Your Lambda
`testSecret`	Pending applied	Authenticate with pending	Pending proven good	Your Lambda
`finishSecret`	Pending good	Move `AWSCURRENT` → pending	Pending becomes current; old → previous	Your Lambda

The most common bug is non-idempotent steps. Secrets Manager retries; if createSecret blindly generates a new password every invocation, you desync the stored secret from what setSecret applied. Always check for an existing AWSPENDING first.

def create_secret(service_client, arn, token):
    # Source of truth is AWSCURRENT; we derive the new value from it.
    current = service_client.get_secret_value(SecretId=arn, VersionStage="AWSCURRENT")

    try:
        service_client.get_secret_value(
            SecretId=arn, VersionId=token, VersionStage="AWSPENDING"
        )
        # Pending already staged on a retry — do not regenerate.
        return
    except service_client.exceptions.ResourceNotFoundException:
        pass

    secret = json.loads(current["SecretString"])
    secret["password"] = service_client.get_random_password(
        PasswordLength=32, ExcludePunctuation=True
    )["RandomPassword"]

    service_client.put_secret_value(
        SecretId=arn,
        ClientRequestToken=token,
        SecretString=json.dumps(secret),
        VersionStages=["AWSPENDING"],
    )

The finishSecret step is equally precise — it is a relabel, not a write. Find the version currently holding AWSCURRENT (returning early if it is already token, i.e. a retry), then move the label:

def finish_secret(service_client, arn, token, current_version):
    if current_version == token:
        return  # Already promoted (retry).
    service_client.update_secret_version_stage(
        SecretId=arn,
        VersionStage="AWSCURRENT",
        MoveToVersionId=token,
        RemoveFromVersionId=current_version,
    )

For most database engines you do not write this yourself — AWS publishes managed rotation functions as serverless application templates. The managed functions you should reach for before writing custom code:

Backing service	Managed function family	Strategies offered	Write custom instead?
RDS/Aurora PostgreSQL	`SecretsManagerRDSPostgreSQLRotation…`	Single-user, MultiUser	No — use managed
RDS/Aurora MySQL/MariaDB	`SecretsManagerRDSMySQLRotation…`	Single-user, MultiUser	No — use managed
RDS SQL Server	`SecretsManagerRDSSQLServerRotation…`	Single-user, MultiUser	No — use managed
RDS Oracle	`SecretsManagerRDSOracleRotation…`	Single-user, MultiUser	No — use managed
Amazon Redshift	`SecretsManagerRedshiftRotation…`	Single-user, MultiUser	No — use managed
DocumentDB	`SecretsManagerMongoDBRotation…`	Single-user, MultiUser	No — use managed
Generic / third-party API	`SecretsManagerRotationTemplate`	You implement	Yes — start from the template
OS / SSH / non-API target	(none)	—	Yes — full custom

The dispatcher skeleton that ties the four steps together — note the explicit guard that rotation is even enabled and that the version is pending:

def lambda_handler(event, context):
    arn, token, step = event["SecretId"], event["ClientRequestToken"], event["Step"]
    client = boto3.client("secretsmanager")

    md = client.describe_secret(SecretId=arn)
    if not md.get("RotationEnabled"):
        raise ValueError(f"Rotation not enabled for {arn}")
    versions = md["VersionIdsToStages"]
    if token not in versions:
        raise ValueError(f"Version {token} has no stage for {arn}")
    if "AWSCURRENT" in versions[token]:
        return  # Already current — nothing to do.

    current_version = next(v for v, s in versions.items() if "AWSCURRENT" in s)
    {
        "createSecret": lambda: create_secret(client, arn, token),
        "setSecret":    lambda: set_secret(client, arn, token),
        "testSecret":   lambda: test_secret(client, arn, token),
        "finishSecret": lambda: finish_secret(client, arn, token, current_version),
    }[step]()

3. RDS rotation: single-user vs alternating-user

For RDS, RDS Proxy, DocumentDB, and Redshift, you choose a rotation strategy, and the choice has real operational consequences.

Single-user rotation changes the password on the same database user every cycle. Simple, but hazardous: between setSecret (which runs ALTER USER ... PASSWORD) and the moment your application picks up the new AWSCURRENT, any connection attempt with the old password fails. If your app caches the secret and only refreshes on auth failure, you get a brief window of errors every rotation. Acceptable for low-traffic or fault-tolerant apps; not for a hot path.

Alternating-user rotation (the “multi-user” strategy) is the production-grade choice. It clones the existing user into a second account and alternates between the two each cycle. Cycle N uses app_user; cycle N+1 rotates app_user_clone, swaps AWSCURRENT to it, and leaves the previous user fully valid. Because the newly promoted credential belongs to a user that already existed with a known-good password, there is no window where the current credential is invalid.

Strategy	Users	Rotation-window failures	Setup requirement	Best for
Single-user	1	Possible (brief)	App user can `ALTER` itself	Low-traffic, fault-tolerant apps
Alternating-user	2	None if app respects `AWSCURRENT`	A superuser secret to clone the user	Hot paths, fleets, strict SLAs

A fuller side-by-side, because the trade-offs go beyond the failure window:

Dimension	Single-user	Alternating-user
DB users required	1	2 (original + clone)
Master/superuser secret	Not required	Required (`masterarn`)
Invalid-credential window	Yes (until cache refresh)	None
Privileges the rotator needs	`ALTER USER` on self	`CREATE/ALTER USER`, clone grants
Behaviour on app cache lag	Auth errors until refresh	Old user still valid → no errors
Rollback safety	One generation	One generation + prior user live
Operational complexity	Lower	Higher (two users to reason about)
Recommended for hot paths	No	Yes

Alternating-user requires a second secret — the master/superuser secret — referenced via masterarn, because cloning a user and granting its privileges needs elevated rights the application user does not have. The rotation Lambda authenticates with the master secret to provision the clone, then rotates the application secret.

# Enable alternating-user rotation on an RDS secret using the AWS-managed
# Postgres rotation function, every 30 days.
aws secretsmanager rotate-secret \
  --secret-id prod/payments/db-app \
  --rotation-lambda-arn arn:aws:lambda:us-east-1:111122223333:function:SecretsManagerRDSPostgreSQLRotationMultiUser \
  --rotation-rules '{"ScheduleExpression": "rate(30 days)", "Duration": "2h"}'

The application’s secret must declare which strategy it uses. A multi-user Postgres secret looks like this — every field below is load-bearing for the managed function:

{
  "engine": "postgres",
  "host": "payments.cluster-abc123.us-east-1.rds.amazonaws.com",
  "username": "app_user",
  "password": "current-password",
  "dbname": "payments",
  "port": 5432,
  "masterarn": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/payments/db-master-XyZ"
}

The required and optional keys in an RDS secret JSON, and what the managed function does with each:

Key	Required?	Used by	Notes / gotcha
`engine`	Yes	Function (driver select)	`postgres`, `mysql`, `oracle`, `sqlserver`, etc.
`host`	Yes	Connect	Cluster endpoint; use the writer for DDL
`port`	Yes	Connect	5432 PG, 3306 MySQL, 1521 Oracle, 1433 MSSQL
`username`	Yes	`ALTER`/auth	The rotating app user
`password`	Yes	Auth	The function overwrites this each cycle
`dbname`	Often	Connect	Some engines require it to authenticate
`masterarn`	Alternating only	Clone the user	Omit → single-user; present → multi-user
`dbInstanceIdentifier`	Optional	Resolve host	Alternative to `host` for some templates

The schedule controls in --rotation-rules, and how each behaves:

Field	Meaning	Example	Notes
`ScheduleExpression`	When rotation fires	`rate(30 days)` / `cron(0 3 1 * ? *)`	`rate` minimum is effectively daily
`Duration`	Rotation window length	`"2h"`	SM starts within this window, not exactly on the dot
`AutomaticallyAfterDays`	Legacy interval (days)	`30`	Superseded by `ScheduleExpression`
(manual trigger)	`rotate-secret` with no rules	—	Runs all four steps immediately, once

Put RDS Proxy in front of this. The proxy keeps a warm connection pool and integrates natively with Secrets Manager, so when AWSCURRENT flips it picks up the new credential without each application instance re-reading the secret. Combined with alternating-user rotation, cutover becomes a non-event — covered in depth in RDS Proxy: Connection Pooling, Failover, IAM Auth.

4. VPC, networking, and KMS permissions

This is where rotation breaks in practice, and the failure is always the same flavour: the Lambda times out with no useful error. The function has to reach two endpoints — the database and the Secrets Manager API — and both paths must be open.

If your database is in private subnets, the rotation Lambda must attach to the same VPC with security-group access to the DB port. But a VPC Lambda loses default internet egress and still needs to call Secrets Manager. The two ways to give it API access, side by side:

Egress option	How it works	Cost	When to choose	Gotcha
Secrets Manager interface endpoint	`com.amazonaws.<region>.secretsmanager` ENI in the Lambda subnets	~hourly/ENI + per-GB	Private workloads (the right answer)	Needs `private_dns_enabled` and an SG allowing 443
NAT gateway	Route 0.0.0.0/0 → NAT → IGW	Hourly + per-GB egress	You already run NAT for other reasons	Pays egress; path is public-ish
(none — Lambda not in VPC)	Public Lambda reaches SM directly	None extra	DB is publicly reachable (rare/bad)	Can’t reach a private DB at all

The two endpoints the rotation Lambda must reach, and what blocks each:

Target	Reached over	Requires	If blocked you see
RDS/Aurora (DB port)	VPC, in-subnet	SG ingress on 5432/3306 from rotator SG	Timeout at `setSecret`/`testSecret`
Secrets Manager API	Interface endpoint or NAT	Endpoint + SG 443, or NAT route	Timeout before any step logs
KMS (decrypt the secret)	Endpoint or NAT	`kms:Decrypt` + reachable KMS	`AccessDenied` or KMS timeout
(alternating) Master secret read	Secrets Manager API	Same as SM API + read on master	`setSecret` cannot clone the user

resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id              = var.vpc_id
  service_name        = "com.amazonaws.${var.region}.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.sm_endpoint.id]
  private_dns_enabled = true
}

# The rotation Lambda's SG must be allowed inbound on the DB port.
resource "aws_security_group_rule" "db_from_rotator" {
  type                     = "ingress"
  from_port                = 5432
  to_port                  = 5432
  protocol                 = "tcp"
  security_group_id        = aws_security_group.rds.id
  source_security_group_id = aws_security_group.rotation_lambda.id
}

For KMS: if the secret is encrypted with a customer-managed key (CMK) rather than the AWS-managed aws/secretsmanager key — and at scale it should be, for cross-account and auditability reasons — the rotation Lambda’s execution role needs kms:Decrypt and kms:GenerateDataKey on that key, and the key policy must permit Secrets Manager. The role also needs the standard rotation permissions plus secretsmanager:GetRandomPassword.

The exact IAM actions the rotation execution role needs, and why each is there:

Action	On resource	Why the rotator needs it
`secretsmanager:DescribeSecret`	The secret	Read rotation state and version stages
`secretsmanager:GetSecretValue`	The secret (+ master)	Read `AWSCURRENT`/`AWSPENDING`, master to clone
`secretsmanager:PutSecretValue`	The secret	Write the new `AWSPENDING` version
`secretsmanager:UpdateSecretVersionStage`	The secret	Move `AWSCURRENT` at `finishSecret`
`secretsmanager:GetRandomPassword`	`*`	Generate the new password
`kms:Decrypt`	The CMK	Decrypt the stored secret
`kms:GenerateDataKey`	The CMK	Encrypt the new version
`ec2:CreateNetworkInterface` etc.	`*`	VPC Lambda ENI management (managed policy)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:DescribeSecret",
        "secretsmanager:GetSecretValue",
        "secretsmanager:PutSecretValue",
        "secretsmanager:UpdateSecretVersionStage"
      ],
      "Resource": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/*",
      "Condition": {
        "StringEquals": { "aws:ResourceTag/rotation": "managed" }
      }
    },
    {
      "Effect": "Allow",
      "Action": "secretsmanager:GetRandomPassword",
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["kms:Decrypt", "kms:GenerateDataKey"],
      "Resource": "arn:aws:kms:us-east-1:111122223333:key/<cmk-id>"
    }
  ]
}

Secrets Manager invokes your function on your behalf, so the Lambda’s resource policy must grant lambda:InvokeFunction to secretsmanager.amazonaws.com. The AWS console wires this automatically; in IaC you add it explicitly:

resource "aws_lambda_permission" "allow_secretsmanager" {
  statement_id  = "AllowSecretsManagerInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.rotator.function_name
  principal     = "secretsmanager.amazonaws.com"
}

The three policies rotation touches — confusing them is a common time-sink, so keep them straight:

Policy	Attached to	Direction	Grants
Execution role (identity)	The Lambda	What the Lambda may do	SM read/write, KMS, ENI
Lambda resource policy	The Lambda	Who may invoke it	`secretsmanager.amazonaws.com` invoke
KMS key policy	The CMK	Who may use the key	SM service + rotator + (cross-acct) consumer
Secret resource policy	The secret	Who else may read it	Consumer account (cross-account)

5. A custom rotator for non-RDS credentials

For third-party API keys, a MongoDB Atlas user, or an internal service token, there is no managed function — you implement the four steps against the provider’s API. The shape is identical; only setSecret and testSecret change. The key discipline: many providers do not let you set a key value — they generate and return it. There, createSecret and setSecret collapse — you call the provider’s “create credential” API once, store the result as AWSPENDING, and make setSecret a near no-op. Clean up the old credential only in finishSecret, never before — deleting the old key before the new one is promoted is how you cause an outage.

The two provider archetypes and how the four steps adapt to each:

Step	Provider accepts a value (DB-like)	Provider issues the value (API-key-like)
`createSecret`	Generate value, store `AWSPENDING`	Call provider “create key”, store returned value as `AWSPENDING`
`setSecret`	Apply value to the service (`ALTER USER`)	Near no-op (key already live at provider)
`testSecret`	Authenticate with pending value	Make an authenticated call with pending key
`finishSecret`	Move `AWSCURRENT`	Move `AWSCURRENT`, then delete the old provider key

def set_secret(service_client, arn, token):
    pending = json.loads(
        service_client.get_secret_value(
            SecretId=arn, VersionId=token, VersionStage="AWSPENDING"
        )["SecretString"]
    )
    # Idempotent: only create at the provider if this pending key is not live yet.
    if not provider_key_exists(pending["api_key_id"]):
        provider_create_key(pending["api_key_id"], pending["api_key_secret"])

def test_secret(service_client, arn, token):
    pending = json.loads(
        service_client.get_secret_value(
            SecretId=arn, VersionId=token, VersionStage="AWSPENDING"
        )["SecretString"]
    )
    resp = provider_authenticated_call(pending["api_key_secret"])
    if resp.status_code != 200:
        raise ValueError("Pending credential failed validation")

Custom-rotator pitfalls that cause real outages, and the discipline that avoids each:

Pitfall	What happens	Discipline
Delete old key in `setSecret`	Old key dies before promotion → outage	Delete only in `finishSecret`, after cutover
Non-idempotent provider create	Retry creates duplicate keys	Check `provider_key_exists` before create
No `testSecret` call	Broken key gets promoted	Always make a real authenticated call
Assuming you can set the key value	Provider rejects/ignores it	Store what the provider returns
Hard provider rate limits	Create/delete throttled mid-cycle	Back off; keep the window (`Duration`) generous
Leaking the old key forever	Credential sprawl	`finishSecret` revokes the prior credential

Deploy and schedule a custom rotator the same way, pointing --rotation-lambda-arn at your function. Validate one step at a time before enabling the schedule — rotate-secret runs all four immediately, and you do not want to discover a broken finishSecret in production.

6. Cross-account sharing

A central account often owns secrets that spoke-account workloads consume — a shared database credential, a partner API key. Two things must both be true or it silently fails: a resource policy on the secret, and KMS permissions on the encrypting key. And on the consumer side, the principal needs its own identity policy. Three independent allows, evaluated in three different places.

The three grants cross-account sharing requires, where each lives, and the symptom when it is the one missing:

Grant	Lives in	Account	Symptom if missing
Secret resource policy	On the secret	Owner	`AccessDenied` on `GetSecretValue`
KMS key policy `Decrypt`	On the CMK	Owner	`AccessDenied` on `kms:Decrypt` (can’t decrypt)
Identity policy	On the principal	Consumer	`AccessDenied` even though resource policy allows
CMK (not `aws/sm`)	KMS choice	Owner	Managed key can’t be shared at all

The resource policy grants the consumer account access to the secret:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::444455556666:root" },
      "Action": ["secretsmanager:GetSecretValue", "secretsmanager:DescribeSecret"],
      "Resource": "*"
    }
  ]
}

The trap: you cannot use the AWS-managed aws/secretsmanager key for cross-account access — its key policy is not editable to permit another account. Encrypt cross-account secrets with a customer-managed CMK and grant the consumer account kms:Decrypt in the key policy:

{
  "Sid": "AllowConsumerDecrypt",
  "Effect": "Allow",
  "Principal": { "AWS": "arn:aws:iam::444455556666:root" },
  "Action": "kms:Decrypt",
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "kms:ViaService": "secretsmanager.us-east-1.amazonaws.com"
    }
  }
}

Why aws/secretsmanager cannot go cross-account, contrasted with a CMK — this single distinction is the most common cross-account failure:

Property	`aws/secretsmanager` (managed)	Customer-managed CMK
Key policy editable	No	Yes
Cross-account grant possible	No	Yes
Cost	Free	~$1/key/month + API
Rotation of the key itself	AWS-managed	You choose (annual+)
Auditability granularity	Coarse	Per-key CloudTrail
Use for shared secrets	Never	Always

On the consumer side, the IAM principal still needs its own identity-based policy allowing secretsmanager:GetSecretValue and kms:Decrypt against those ARNs — resource policy and identity policy must both allow, since each account authorizes independently. Confirm the full chain with the policy simulator before you trust it:

# From the CONSUMER account: does this role actually get the secret?
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::444455556666:role/app-role \
  --action-names secretsmanager:GetSecretValue kms:Decrypt \
  --resource-arns arn:aws:secretsmanager:us-east-1:111122223333:secret:shared/partner-api-XyZ \
  --query 'EvaluationResults[].{action:EvalActionName, decision:EvalDecision}'

Rotation keeps running in the owning account; consumers always resolve AWSCURRENT and so follow rotation automatically. The deeper cross-account mechanics (external ID, confused-deputy, session policies) are in IAM Cross-Account Roles, External ID, Confused Deputy; the KMS half is in AWS KMS Encryption Deep Dive.

7. Application integration without rotation-window failures

The cache makes or breaks the consumer experience. Calling GetSecretValue on every request is slow and hits API throttling. The right pattern is the AWS caching client (Java, Python, Go, .NET, or the Lambda extension), which caches AWSCURRENT in memory and refreshes on an interval.

from aws_secretsmanager_caching import SecretCache, SecretCacheConfig
import boto3

client = boto3.client("secretsmanager")
cache = SecretCache(config=SecretCacheConfig(secret_refresh_interval=3600), client=client)

def get_connection():
    secret = json.loads(cache.get_secret_string("prod/payments/db-app"))
    return connect(secret)  # On auth failure, force-refresh and retry once.

The integration approaches ranked from worst to best, with the trade-off each makes:

Pattern	Latency	API cost	Rotation-safe?	Verdict
`GetSecretValue` per request	High	High (throttles)	Yes (always fresh)	Never do this
Read once at boot, never refresh	Low	Minimal	No — misses rotation	Breaks on every cycle
Caching client (TTL)	Low	Low	Yes (refreshes)	Recommended for services
Caching client + refresh-on-auth-fail	Low	Low	Yes + self-heals	Best for single-user
Lambda extension (localhost cache)	Lowest	Low	Yes	Best for serverless
RDS Proxy (proxy holds the secret)	Lowest	None (app)	Yes — cutover at proxy	Best for shared DB creds

Two rules keep you safe across rotations:

Refresh on auth failure, not only on a timer. If a DB connection is rejected, invalidate the cache entry, re-read AWSCURRENT, and retry once. With alternating-user rotation this almost never fires, but it is your safety net for single-user.
Never pin a VersionId. Read AWSCURRENT (the default). Pinning a version means you never follow rotation — the cardinal sin.

The caching-client knobs and how to reason about each:

Setting	What it does	Default	When to change
`secret_refresh_interval`	TTL before re-reading `AWSCURRENT`	3600 s	Lower for faster pickup; higher to cut API calls
`max_cache_size`	How many secrets cached	1024	Raise if a process reads many secrets
`exception_retry_delay_base`	Backoff on API errors	seconds	Tune under throttling
Force-refresh on auth fail	Re-read immediately on rejection	your code	Always implement for single-user
Version stage requested	Which label to read	`AWSCURRENT`	Leave default — never pin a version

The Lambda extension is cleanest for serverless consumers: it runs a local HTTP cache with built-in TTL, so your function makes a localhost call instead of hitting the Secrets Manager API.

8. Monitoring rotation health and staleness

Rotation that fails silently is worse than no rotation, because you believe you are covered. Wire three signals:

RotationFailed — Secrets Manager emits this event on failure; route it through EventBridge to SNS.
RotationSucceeded — track it to detect absence. A secret that has not rotated on schedule is the real risk, visible only by alarming on staleness.
Last-rotated age — a scheduled check comparing LastRotatedDate against the schedule, alerting on overdue secrets.

The signals to wire, what each tells you, and how to surface it:

Signal	Source	Tells you	Surface via
`RotationFailed`	SM event (CloudTrail)	A cycle broke right now	EventBridge → SNS/PagerDuty
`RotationSucceeded`	SM event (CloudTrail)	A cycle completed	Metric; alarm on absence
`LastRotatedDate` age	`describe-secret`	Secret is overdue	Scheduled Lambda / Config
Lingering `AWSPENDING`	`VersionIdsToStages`	A step failed mid-cycle	Scheduled check / dashboard
`secretsmanager-rotation-enabled-check`	AWS Config rule	Rotation not even enabled	Config dashboard / SNS
`secretsmanager-secret-periodic-rotation`	AWS Config rule	Not rotated within max age	Config dashboard / SNS

{
  "source": ["aws.secretsmanager"],
  "detail-type": ["AWS Service Event via CloudTrail"],
  "detail": {
    "eventName": ["RotationFailed"]
  }
}

For staleness, AWS Config is the durable approach. Its managed rules secretsmanager-rotation-enabled-check and secretsmanager-secret-periodic-rotation flag secrets without rotation enabled or not rotated within a maximum age — far better than a custom script you will forget to maintain. Pair this with the CloudTrail/CloudWatch foundations in CloudWatch & CloudTrail Observability Deep Dive.

Architecture at a glance

The diagram traces a single rotation across two accounts, left to right. On the far left, the consumer account (444455556666) runs the app — or, better, RDS Proxy — which only ever reads AWSCURRENT through the caching client, and a consumer IAM role that must carry its own GetSecretValue + kms:Decrypt grant. Next is the owner account (111122223333), which holds the secret (with its AWSCURRENT/AWSPENDING versions), the resource policy that grants the consumer, and the customer-managed CMK — explicitly not aws/secretsmanager, because the managed key cannot be shared. The third zone is the rotation tier living inside the database VPC: the four-step rotation Lambda and the Secrets Manager interface VPC endpoint that lets that VPC-bound Lambda reach the SM API at all. The right-most zone is the data tier in private subnets: the Aurora PostgreSQL cluster on 5432 with its two alternating users, and the elevated master secret whose superuser clones the application user.

Follow the flows: the consumer calls GetSecretValue and receives whatever AWSCURRENT points at; Secrets Manager invokes the rotation Lambda through its four steps; the Lambda reaches the database on 5432 to ALTER USER, then writes the new version back with PutSecretValue. The five numbered badges sit on exactly the hops that stall in practice — the Lambda losing its route to the DB (1), the missing SM endpoint (2), the master secret needed to clone (3), the un-shareable managed key (4), and the consumer’s missing identity grant (5) — and the legend narrates each as symptom, how to confirm, and the fix. Read the badges as the failure map laid directly over the architecture.

Real-world scenario

A payments platform team — call them NorthPay — ran 40+ microservices against an Aurora PostgreSQL cluster, all sharing one application credential rotated single-user every 7 days. Every rotation produced a 20-to-90 second spike of 500s as services hit auth failures and slowly refreshed their caches. The on-call playbook literally said “ignore the Thursday 02:00 error spike” — which is exactly how a real incident later got ignored for eleven minutes, because operators had been trained to wave off the Thursday alarm.

The constraints were hard: they could not coordinate a synchronized restart of 40 services, and the payments hot path’s error budget (a 99.95% SLO, ~21 minutes/month) was being burned by the rotation spike alone — four cycles a month at up to 90 seconds each was a meaningful fraction of the entire budget, spent on a self-inflicted event. The fix had two moves. First, they switched the secret to alternating-user rotation with a master secret, so the promoted credential always belonged to a user that already existed with a valid password — eliminating the invalid-credential window. Second, they put RDS Proxy in front of the cluster so cutover happened once, at the proxy, instead of independently in 40 connection pools.

aws rds create-db-proxy \
  --db-proxy-name payments-proxy \
  --engine-family POSTGRESQL \
  --auth '[{"AuthScheme":"SECRETS","SecretArn":"arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/payments/db-app-XyZ","IAMAuth":"DISABLED"}]' \
  --role-arn arn:aws:iam::111122223333:role/payments-proxy-role \
  --vpc-subnet-ids subnet-aaa subnet-bbb

There was a second-order problem the team only found in staging: the rotation Lambda had originally been deployed outside the VPC (it had worked because the dev database was publicly reachable), and moving it into the production VPC immediately broke it with a bare timeout. The cause was the classic missing Secrets Manager interface endpoint — the VPC Lambda could now reach the database but had lost its path to the SM API. Adding com.amazonaws.us-east-1.secretsmanager with private DNS in the Lambda’s subnets fixed it in one change. They also moved the secret off aws/secretsmanager onto a CMK in the same motion, because a sibling analytics account needed read access and the managed key could never grant it.

The before/after, measured over a quarter:

Metric	Before (single-user, no proxy)	After (alternating-user + proxy)
Rotation-window 5xx	20–90 s spike every cycle	Zero across the quarter
Error budget spent on rotation	~30% of monthly SLO	~0%
Services needing a restart on rotation	Up to 40 (uncoordinated)	0 (cutover at proxy)
Cross-account analytics read	Impossible (`aws/sm` key)	Working (CMK grant)
“Ignore the Thursday spike” runbook line	Present	Deleted

After the change, rotation-window errors went to zero, and the “ignore the Thursday spike” line was deleted from the runbook — the real win, because a runbook that trains operators to ignore alerts is a latent incident.

Advantages and disadvantages

Automatic rotation is the right default for any credential that can rotate, but it is not free of cost or risk. The honest trade-off:

Advantages	Disadvantages
Short-lived credentials shrink blast radius	Adds a moving part (the Lambda) that can fail
Zero-downtime cutover (alternating-user)	Networking/KMS wiring is unforgiving when wrong
Consumers follow `AWSCURRENT` automatically	Silent failure is possible without monitoring
Managed functions cover most databases	Custom targets need real engineering
Native audit via CloudTrail	Per-secret + per-API + CMK cost at scale
Cross-account sharing without copying secrets	Three-place authorization is easy to get wrong
Compliance: provable rotation cadence	Stuck `AWSPENDING` needs operational hygiene

When each side dominates: for production database credentials on a hot path, the advantages are decisive — alternating-user + RDS Proxy turns the biggest risk (the cutover window) into a non-event, and the cost is rounding error against the cost of a leaked standing password. For a rarely-used, low-value, easily-revoked credential, the disadvantages can outweigh the benefit — a rotation Lambda you must maintain for a secret nobody attacks is overhead; a long manual-rotation interval may be enough. For third-party credentials with hard rate limits or fragile APIs, weigh the custom-rotator engineering against simply rotating manually on a calendar.

Hands-on lab

This lab stands up single-user rotation on an RDS PostgreSQL secret end to end, triggers a manual cycle, verifies the staging labels moved, and tears everything down. It assumes an existing RDS PostgreSQL instance reachable from where you run the rotation function; for a fully free-tier path, use a db.t3.micro (free tier eligible for 12 months) in a public subnet so you can skip the VPC endpoint for the lab (never do this in production).

The lab at a glance:

Step	Action	Command	Expected result
1	Create the secret with engine metadata	`create-secret`	Secret ARN returned
2	Deploy the managed rotation function	SAR / console	Lambda ARN returned
3	Grant SM invoke on the Lambda	`add-permission`	Statement added
4	Enable rotation	`rotate-secret --rotation-lambda-arn`	`VersionId` for pending
5	Trigger a manual cycle	`rotate-secret`	All four steps run
6	Verify labels moved	`describe-secret`	New `AWSCURRENT`, no `AWSPENDING`
7	Confirm it authenticates	`get-secret-value` + `psql`	Login succeeds
8	Tear down	`delete-secret`, delete Lambda	Resources gone

Step 1 — create the secret with the engine metadata the managed function needs:

aws secretsmanager create-secret \
  --name lab/rotation/pg-app \
  --secret-string '{
    "engine":"postgres",
    "host":"lab-db.abc123.us-east-1.rds.amazonaws.com",
    "username":"app_user",
    "password":"InitialPassw0rd!",
    "dbname":"appdb",
    "port":5432
  }'

Step 2 — deploy the managed Postgres single-user rotation function from the Serverless Application Repository (the console “Edit rotation” wizard does this for you; the CLI path uses aws serverlessrepo create-cloud-formation-change-set). For the lab, the console wizard is fastest: open the secret → Rotation → Edit rotation → Create a new Lambda function → choose the single-user template.

Step 3 — grant Secrets Manager permission to invoke the function (the console does this; shown here for completeness):

aws lambda add-permission \
  --function-name SecretsManagerRotationLabFn \
  --statement-id AllowSM \
  --action lambda:InvokeFunction \
  --principal secretsmanager.amazonaws.com

Step 4 — enable rotation pointing at the function, on a 30-day schedule:

aws secretsmanager rotate-secret \
  --secret-id lab/rotation/pg-app \
  --rotation-lambda-arn arn:aws:lambda:us-east-1:111122223333:function:SecretsManagerRotationLabFn \
  --rotation-rules '{"ScheduleExpression":"rate(30 days)"}'

Step 5 — trigger a manual cycle now (runs all four steps immediately):

aws secretsmanager rotate-secret --secret-id lab/rotation/pg-app

Step 6 — verify the labels moved. This is the moment of truth:

aws secretsmanager describe-secret --secret-id lab/rotation/pg-app \
  --query '{Rotated:LastRotatedDate, Versions:VersionIdsToStages}'
# Expect: LastRotatedDate just now; one version AWSCURRENT, one AWSPREVIOUS, NO AWSPENDING.

Step 7 — confirm the live credential authenticates:

CRED=$(aws secretsmanager get-secret-value --secret-id lab/rotation/pg-app \
  --query SecretString --output text)
PGPASSWORD=$(echo "$CRED" | jq -r .password) \
  psql -h $(echo "$CRED" | jq -r .host) -U app_user -d appdb -c '\conninfo'
# "You are connected to database appdb" → rotation produced a working credential.

Step 8 — tear down so you stop paying for the secret and the instance:

aws secretsmanager delete-secret --secret-id lab/rotation/pg-app \
  --force-delete-without-recovery
aws lambda delete-function --function-name SecretsManagerRotationLabFn
# Delete the RDS instance from the console or:
aws rds delete-db-instance --db-instance-identifier lab-db \
  --skip-final-snapshot --delete-automated-backups

What each teardown action stops costing you:

Resource	Lingering cost if left	Teardown
Secret	~$0.40/month + API calls	`delete-secret --force-delete-without-recovery`
Rotation Lambda	Per-invocation (negligible idle)	`delete-function`
RDS instance	Hourly + storage	`delete-db-instance --skip-final-snapshot`
CMK (if created)	~$1/month	Schedule key deletion (7–30 day window)
VPC endpoint (if created)	Hourly per ENI	`delete-vpc-endpoints`

Common mistakes & troubleshooting

This is the section you keep open at 02:00. Each row is a real failure mode: the symptom, the root cause, the exact command or console path to confirm it, and the fix.

#	Symptom	Root cause	Confirm (exact command / path)	Fix
1	Lambda times out, secret stuck `AWSPENDING`	No route to the DB or the SM API	CloudWatch Logs: function ends mid-step; `describe-secret` shows lingering `AWSPENDING`	Put rotator in DB VPC; add SM interface endpoint; open SG on DB port
2	Timeout before any step logs	VPC Lambda has no Secrets Manager endpoint	Logs show nothing; ENI exists but no 443 path	Add `com.amazonaws.<region>.secretsmanager` interface endpoint + private DNS (or NAT)
3	`setSecret` fails: permission denied	Alternating-user without `masterarn` / weak master	Logs: `permission denied to create/alter user`; secret JSON lacks `masterarn`	Add `masterarn`; master secret holds a superuser that can clone
4	Every step re-runs and desyncs	Steps are not idempotent	Logs show password regenerated each `createSecret` retry	Guard `createSecret` on existing `AWSPENDING`; guard `finishSecret` on already-current
5	Consumer gets `AccessDenied` on `GetSecretValue`	Missing/incorrect secret resource policy	`get-resource-policy`; `simulate-principal-policy`	Add resource policy granting the consumer account
6	Consumer gets `AccessDenied` on `kms:Decrypt`	Secret on `aws/secretsmanager` (uneditable)	`describe-secret` shows `KmsKeyId: alias/aws/secretsmanager`	Re-encrypt with a CMK; grant consumer `kms:Decrypt` via `kms:ViaService`
7	Resource policy allows but consumer still denied	Consumer identity policy missing the grant	`simulate-principal-policy` from the consumer role	Add `GetSecretValue` + `kms:Decrypt` to the consumer principal
8	App auth fails right after a clean rotation	App pinned to a `VersionId`	App config reads a fixed version, not `AWSCURRENT`	Read `AWSCURRENT` (default); never pin a version
9	5xx spike every rotation cycle	Single-user on a hot path	App logs show auth failures clustered at rotation time	Switch to alternating-user; front with RDS Proxy
10	Secret believed rotating but never does	Rotation disabled or silently failing	`describe-secret`: `RotationEnabled:false` or stale `LastRotatedDate`	Enable rotation; add Config rule + `RotationFailed` alarm
11	`rotate-secret` returns but nothing changes	SM can’t invoke the Lambda	Lambda resource policy lacks `secretsmanager.amazonaws.com`	Add `lambda:InvokeFunction` permission for the SM service principal
12	`finishSecret` runs forever / loops	`current_version` never updated, retry logic wrong	Logs: `UpdateSecretVersionStage` repeats	Return early when `current_version == token`; move label once
13	Rotation works in dev, times out in prod	Dev DB public; prod DB private, no endpoint	Same Lambda, only network differs	Add VPC config + SM endpoint + SG ingress in prod
14	KMS `AccessDenied` for the rotator itself	Role missing `Decrypt`/`GenerateDataKey` on the CMK	CloudTrail KMS event `AccessDenied`	Grant the execution role both KMS actions on the CMK

The decision table for the most common ambiguous signal — “rotation isn’t working”:

If you see…	It’s probably…	Do this
Lingering `AWSPENDING`, logs stop at `setSecret`/`testSecret`	No DB route or DB-side permission	Check SG ingress + DB grants
Timeout with zero step logs	No SM API egress (missing endpoint/NAT)	Add SM interface endpoint
`RotationEnabled:false`	Rotation never enabled	`rotate-secret --rotation-lambda-arn …`
`LastRotatedDate` old, `RotationEnabled:true`	Silent recurring failure	Check `RotationFailed` events / logs
Consumer denied, owner can read fine	Cross-account grant incomplete	Verify all three: resource + KMS + identity
App fails only at rotation time	Cache/version-pin problem	Read `AWSCURRENT`; refresh on auth fail

Best practices

Use a managed rotation function for any supported database; write custom code only for targets AWS does not cover. The managed functions are battle-tested and idempotent.
Choose alternating-user for any hot path, and reserve single-user for low-traffic, fault-tolerant credentials where a brief window is acceptable.
Always front shared database credentials with RDS Proxy so cutover happens once, at the proxy, instead of independently in every connection pool.
Put the rotation Lambda in the database VPC with a Secrets Manager interface endpoint and security-group access to the DB port — and verify both paths before enabling the schedule.
Encrypt with a customer-managed CMK, never aws/secretsmanager, for anything that might go cross-account, needs key-level audit, or needs an editable key policy.
Make every step idempotent: guard createSecret on an existing AWSPENDING, and finishSecret on an already-current version. Secrets Manager will retry.
Never delete the old credential before finishSecret in custom rotators — the old one must stay valid until cutover completes.
Consumers read AWSCURRENT only, through the caching client, refreshing on auth failure and never pinning a VersionId.
Monitor for silence, not just failure: alarm on RotationFailed, and separately alarm on absence of RotationSucceeded / stale LastRotatedDate via AWS Config.
Validate one step at a time in a non-prod environment before enabling the schedule; rotate-secret runs all four immediately.
Tag secrets for scoped IAM (e.g. rotation=managed) so the rotator role can be least-privilege via a ResourceTag condition.
Keep the rotation window (Duration) generous for slow or rate-limited targets so a single retry doesn’t blow the window.

Security notes

Rotation is a security control, so its own posture must be tight. The least-privilege and isolation rules that matter:

Control	What to do	Why
Rotator role scope	Limit to specific secret ARNs / tags, not `*`	A compromised rotator shouldn’t read every secret
KMS key separation	Per-domain CMKs, not one key for everything	Blast-radius isolation; per-key audit
Cross-account least privilege	Grant `GetSecretValue` to roles, not `:root`, where possible	Narrow the principal that can read
`kms:ViaService` condition	Restrict KMS use to Secrets Manager	Key can’t be used out-of-band by the grantee
Network isolation	Private subnets + interface endpoints	DB and SM traffic never traverses the internet
No secrets in logs	Never log secret values in the rotator	CloudWatch Logs is not a secret store
CloudTrail on KMS + SM	Log every `Decrypt` and `GetSecretValue`	Detect anomalous reads
Resource policy review	Audit who each secret is shared with	Prevent accidental over-sharing
Master secret protection	Tightest controls; superuser power	It can clone/alter any DB user
Deletion recovery window	Keep the default 7–30 day window in prod	Guard against accidental/malicious deletion

The encryption and identity story end to end: the secret is envelope-encrypted under a CMK; the rotator decrypts with kms:Decrypt scoped to that key; consumers decrypt only through secretsmanager.<region>.amazonaws.com via the kms:ViaService condition; and every read is a CloudTrail event you can alarm on. For the deeper KMS model see AWS KMS Encryption Deep Dive and Multi-Region KMS Keys & Envelope Encryption; for the IAM evaluation order behind the three-place authorization, IAM Fundamentals: Users, Roles, Policies, Evaluation and IAM Least Privilege & Permission Boundaries.

Cost & sizing

Secrets Manager pricing is simple but adds up at scale; rotation introduces Lambda and (usually) CMK and endpoint costs on top. The drivers:

Cost driver	Rate (us-east-1, approx)	Scales with	Notes
Secret storage	~$0.40 / secret / month	Number of secrets	The base line item
API calls	~$0.05 / 10,000 calls	Read volume	The caching client slashes this
Rotation Lambda	Per-invocation + duration	Rotations × 4 steps	Tiny; 4 short invocations per cycle
Customer-managed CMK	~$1 / key / month + API	Number of CMKs	Use shared CMKs per domain to bound it
SM interface endpoint	~$0.01/hr per ENI + per-GB	AZs × endpoints	One per AZ in the Lambda subnets
NAT (if used instead)	Hourly + per-GB egress	Egress volume	Often costlier than an endpoint

Right-sizing guidance and rough monthly figures (INR at ~₹84/USD, indicative):

Scenario	Secrets	CMKs	Endpoints	Rough USD/mo	Rough INR/mo
Single app, one DB secret	1	0 (managed key)	0 (public dev)	~$0.40	~₹35
Small prod, private DB	3	1	1 (1 AZ)	~$8–10	~₹700–850
Multi-account platform	30	3	6 (2 AZ ×3)	~$60–80	~₹5,000–6,700
High-read fleet (no cache)	10	1	2	~$20 + heavy API	~₹1,700 + API
Same fleet, caching client	10	1	2	~$15	~₹1,250

The free-tier and cost-control levers worth knowing:

Lever	Effect	Caveat
30-day free trial per secret	First month free	Per secret, one-time
Caching client / Lambda extension	Cuts API calls ~90%+	Implement refresh-on-auth-fail
Shared CMK per domain	Fewer $1/month keys	Coarser blast radius
Interface endpoint over NAT	No per-GB egress on SM traffic	Pay per-ENI-hour instead
RDS Proxy holds the secret	App makes zero SM calls	Proxy has its own hourly cost
Consolidate related values	One secret, fewer line items	Don’t bundle unrelated trust domains

The cost of not rotating — a leaked standing credential leading to data exfiltration — dwarfs all of the above. Treat rotation cost as insurance premium, not overhead.

Interview & exam questions

Q1. Walk me through the four steps of a rotation Lambda. createSecret generates the new value and stores it as AWSPENDING; setSecret applies it to the backing service; testSecret authenticates with the pending value; finishSecret moves the AWSCURRENT label onto the pending version. Each step must be idempotent because Secrets Manager retries.

Q2. Why is alternating-user rotation safer than single-user for a hot path? Single-user changes the password on one user, creating a window where the old cached credential is invalid until the app refreshes. Alternating-user swaps between two users, so the newly promoted credential belongs to a user that already existed with a known-good password — there is no invalid window.

Q3. A rotation Lambda for a private RDS instance times out with no useful error. What’s the most likely cause? It’s in the VPC and can reach the database but has no path to the Secrets Manager API — it’s missing a Secrets Manager interface VPC endpoint (or a NAT route). The classic tell is a timeout before any step logs.

Q4. Why can’t you use the aws/secretsmanager key for a cross-account secret? Its key policy is AWS-managed and not editable, so you cannot grant another account kms:Decrypt. Cross-account secrets must use a customer-managed CMK whose key policy you can edit to permit the consumer account.

Q5. A consumer account has a resource policy granting it the secret but still gets AccessDenied. Why? Authorization is evaluated independently in each account. The consumer’s own identity-based policy must also allow secretsmanager:GetSecretValue (and kms:Decrypt). Resource policy and identity policy must both allow.

Q6. What is masterarn and when is it required? It points to an elevated master/superuser secret. Alternating-user rotation needs it because cloning a user and granting its privileges requires rights the application user does not have; the Lambda authenticates with the master secret to provision the clone.

Q7. Why must rotation steps be idempotent, and where does it bite hardest? Secrets Manager retries on failure. If createSecret regenerates a password on every retry, the stored secret desyncs from what setSecret applied. Guard createSecret on an existing AWSPENDING and finishSecret on an already-current version.

Q8. How do you keep applications from failing during the rotation window? Read AWSCURRENT via the caching client (never pin a VersionId), refresh the cache on auth failure, prefer alternating-user rotation, and front shared DB credentials with RDS Proxy so cutover happens once at the proxy.

Q9. How do you detect a secret that silently stopped rotating? Alarm on the absence of RotationSucceeded, not just on RotationFailed. Use AWS Config rules secretsmanager-rotation-enabled-check and secretsmanager-secret-periodic-rotation, and check LastRotatedDate against the schedule.

Q10. In a custom rotator where the provider issues the key, when do you delete the old credential? Only in finishSecret, after AWSCURRENT has moved. Deleting the old key in createSecret or setSecret kills it before promotion and causes an outage.

Q11. What lingering AWSPENDING tells you, and how to recover. A step failed mid-cycle — usually setSecret/testSecret couldn’t reach the DB, or a permission was missing. Confirm with describe-secret’s VersionIdsToStages, fix the underlying cause (network/permission), then re-trigger rotate-secret.

Q12. Which IAM permissions does the rotation execution role need beyond the obvious secret actions? secretsmanager:GetRandomPassword (on *), kms:Decrypt and kms:GenerateDataKey on the CMK, plus the VPC ENI permissions (via the AWS managed Lambda-VPC policy) when it runs in a VPC.

These map to AWS Certified Security – Specialty (data protection, key management, incident response) and AWS Certified Solutions Architect – Professional (secure multi-account design). The cross-account KMS and resource-policy material is a recurring Security-Specialty theme.

Quick check

Which staging label do applications get by default from GetSecretValue, and which one only exists during a rotation?
You enable rotation on a private RDS secret and the Lambda times out before any step logs. What’s missing?
A spoke account has a secret resource policy granting it access but still gets AccessDenied on GetSecretValue. Name the two things that must also be true.
Why does alternating-user rotation eliminate the rotation-window auth errors that single-user can cause?
In a custom rotator for a provider that issues the key, at which step do you delete the old credential, and why not earlier?

Answers

Applications get AWSCURRENT by default; AWSPENDING exists only during a rotation cycle (and a lingering one means a step failed).
A Secrets Manager interface VPC endpoint (or a NAT route) in the Lambda’s subnets — the VPC Lambda can reach the DB but has no path to the SM API.
The encrypting key must be a customer-managed CMK granting the consumer kms:Decrypt (not aws/secretsmanager), and the consumer’s own identity-based policy must allow GetSecretValue + kms:Decrypt.
Because the promoted credential belongs to a second user that already existed with a known-good password, so there is never a moment where the current credential is invalid while the app catches up.
In finishSecret, after AWSCURRENT has moved — deleting it earlier kills the credential before promotion and causes an outage.

Glossary

Staging label — a movable pointer (AWSCURRENT, AWSPENDING, AWSPREVIOUS, or custom) that points to exactly one version of a secret at a time.
AWSCURRENT — the staging label for the live version returned by default from GetSecretValue; what consumers should always read.
AWSPENDING — the transient label on the new version being created and tested during a rotation cycle; absent between rotations.
AWSPREVIOUS — the label automatically applied to the version that was AWSCURRENT before the last rotation, enabling one-generation rollback.
Rotation Lambda — the function Secrets Manager invokes four times per cycle to create, set, test, and finish a new credential version.
Step — the field in the rotation event (createSecret/setSecret/testSecret/finishSecret) that your function dispatches on.
Single-user rotation — strategy that changes the password on one database user each cycle; simple but has a brief invalid-credential window.
Alternating-user rotation — strategy that swaps between two database users each cycle, eliminating the invalid-credential window; needs a master secret.
masterarn — a secret field pointing to the elevated master/superuser secret used to clone the user in alternating-user rotation.
Interface VPC endpoint — an ENI-backed private path (com.amazonaws.<region>.secretsmanager) letting a VPC-bound Lambda reach the Secrets Manager API without internet egress.
Customer-managed key (CMK) — a KMS key whose key policy you control; the only key type shareable cross-account for secret encryption.
aws/secretsmanager — the AWS-managed default encryption key; free but with an uneditable policy, so it cannot be shared cross-account.
Resource policy — a policy attached to the secret itself, granting other accounts/principals access (the cross-account half of the grant).
kms:ViaService — a KMS condition key restricting use of the key to requests coming through a named service (e.g. Secrets Manager).
RDS Proxy — a managed connection pool that holds the secret and integrates with Secrets Manager, so rotation cutover happens once at the proxy.
Caching client — the AWS SDK helper (or Lambda extension) that caches AWSCURRENT in memory and refreshes on a TTL, cutting API calls and following rotation.

Next steps

Master the storage and naming fundamentals these mechanics build on in the AWS Secrets Manager & Parameter Store Deep Dive.
Go deep on the encryption layer — key policies, grants, envelope encryption — in the AWS KMS Encryption Deep Dive and Multi-Region KMS Keys & Envelope Encryption.
Make rotation cutover a non-event with RDS Proxy: Connection Pooling, Failover, IAM Auth, backed by the AWS RDS & Aurora Deep Dive.
Nail the cross-account trust model in IAM Cross-Account Roles, External ID, Confused Deputy and the evaluation order in IAM Least Privilege & Permission Boundaries.
Wire the alerting that catches silent rotation failure using the CloudWatch & CloudTrail Observability Deep Dive.