Storing a secret in AWS Secrets Manager is the easy part. The thing that actually reduces blast radius is rotation — and rotation is where teams quietly fail, because a botched cycle either locks an application out of its own database or, worse, succeeds silently while applications keep using a credential that no longer works. The mechanics are precise: one staging-label state machine you must respect, a versioning model you cannot shortcut, and networking/KMS permissions that are unforgiving when wrong. Get one of those three wrong and the failure mode is the same flavour every time — a Lambda that times out with no useful error, a secret stuck in AWSPENDING, and a 02:00 pager.
This guide walks the four-step rotation model end to end, contrasts single-user and alternating-user RDS strategies, shows how to write a custom rotator for non-RDS credentials, and covers the cross-account sharing platform teams always end up needing. It is built as a reference you keep open mid-incident: read the prose once, then keep the tables — the step contract, the IAM grants, the VPC/KMS requirements, the failure playbook — open at 02:00. Every operation gets both the aws CLI and the IaC (Terraform) form, because the console wires permissions you will forget to declare in code.
By the end you will stop guessing. When rotation fails you will know whether it is a non-idempotent step, a missing Secrets Manager VPC endpoint, a master secret that cannot clone the user, an aws/secretsmanager-encrypted secret that can never go cross-account, or a consumer whose identity policy was never granted even though the resource policy was. Knowing which in ninety seconds is what separates a five-minute incident from a two-hour one.
What problem this solves
A long-lived database password is a standing liability. It leaks through a log line, a .env committed by accident, a laptop image, an offboarded contractor who still remembers it. The only durable mitigation is to make the credential short-lived — rotate it on a schedule so that any copy an attacker holds expires on its own. Secrets Manager automates that, but the automation has sharp edges, and the edges are exactly where production breaks.
What breaks without getting this right: an engineer enables single-user rotation on a hot-path credential and every cycle produces a spike of authentication errors as forty services slowly notice the password changed; a rotation Lambda is dropped into the database VPC but never given a Secrets Manager endpoint, so it times out reaching the API and the secret sticks half-rotated in AWSPENDING; a central platform team shares a secret cross-account using the AWS-managed key and the consumer account gets AccessDenied on kms:Decrypt forever, because that key’s policy cannot be edited; a finishSecret step is non-idempotent, Secrets Manager retries it, and the staging labels desync. Each of these is perfectly diagnosable and each costs a team an afternoon the first time.
Who hits this: anyone running RDS/Aurora, DocumentDB, Redshift, RDS Proxy, or any credentialed third-party service at scale. It bites hardest on multi-account organisations (cross-account KMS is non-obvious), VPC-isolated databases (the networking is the silent killer), high-traffic hot paths (single-user rotation windows show up as real error budget), and custom integrations where AWS publishes no managed rotation function and you must write all four steps yourself.
To frame the whole field before the deep dive, here is every failure class this article covers, the question it forces, and the one place to look first:
| Failure class | What you observe | First question to ask | First place to look | Most common single cause |
|---|---|---|---|---|
| Rotation Lambda timeout | RotationFailed, stuck AWSPENDING |
Can the Lambda reach BOTH the DB and the SM API? | CloudWatch Logs for the rotator | No Secrets Manager VPC endpoint in private subnets |
| Rotation-window auth errors | 5xx spike every cycle | Single-user or alternating-user? | VersionIdsToStages + app logs |
Single-user on a hot path |
setSecret permission denied |
Logs stop at step 2 | Can the function clone / ALTER USER? |
Rotator logs, DB grants | Missing masterarn (alternating-user) |
Cross-account AccessDenied |
Consumer can’t GetSecretValue |
Resource policy AND KMS AND identity policy all allow? | simulate-principal-policy |
aws/secretsmanager key (uneditable) cross-account |
| Stale / never-rotated secret | Believed covered, never rotated | Is rotation even enabled and succeeding? | AWS Config rules, LastRotatedDate |
Rotation disabled or silently failing |
| App uses dead credential | Auth fails after a clean rotation | Is the app pinned to a VersionId? |
App config / cache code | Pinned version instead of AWSCURRENT |
Learning objectives
By the end of this article you can:
- Explain the staging-label state machine (
AWSCURRENT/AWSPENDING/AWSPREVIOUS) and why rotation is a relabel, never an in-place edit. - Implement the four-step rotation Lambda (
createSecret→setSecret→testSecret→finishSecret) with each step correctly idempotent against Secrets Manager’s retries. - Choose between single-user and alternating-user RDS rotation and explain exactly why alternating-user eliminates the invalid-credential window.
- Stand up the VPC, security-group, and Secrets Manager endpoint wiring a rotation Lambda needs to reach a private database without timing out.
- Author the IAM execution role, Lambda resource policy, and KMS grants rotation requires — and avoid the
aws/secretsmanager-key cross-account trap. - Write a custom rotator for a third-party credential where the provider issues the secret instead of accepting one, collapsing
createSecret/setSecretsafely. - Share a secret cross-account so both the resource policy and the consumer’s identity policy allow, and rotation keeps flowing to consumers automatically.
- Integrate consumers with the caching client so they follow
AWSCURRENTwithout rotation-window failures, and monitor rotation health and staleness with EventBridge and AWS Config.
Prerequisites & where this fits
You should already understand the Secrets Manager basics — that a secret is a named, versioned, KMS-encrypted blob you read with GetSecretValue, billed per secret per month plus per 10,000 API calls. You should be comfortable with IAM identity-based vs resource-based policies, KMS key policies, and the difference between an AWS-managed key (aws/secretsmanager) and a customer-managed key (CMK). Basic VPC literacy (private subnets, security groups, interface endpoints) and the ability to run aws CLI v2 and read JSON output are assumed.
This sits at the intersection of the Security and Databases tracks. It builds directly on the AWS Secrets Manager & Parameter Store Deep Dive (storage, naming, versioning fundamentals) and the AWS KMS Encryption Deep Dive (key policies, grants, envelope encryption — the cross-account half of this article is a KMS problem). On the database side it pairs with the AWS RDS & Aurora Deep Dive and especially RDS Proxy: Connection Pooling, Failover, IAM Auth, because RDS Proxy is what turns rotation cutover into a non-event. The cross-account sharing pattern leans on IAM Cross-Account Roles, External ID, Confused Deputy and Multi-Region KMS Keys & Envelope Encryption. The rotation engine itself is a Lambda running inside the database VPC.
A quick map of who owns what during a rotation incident, so you call the right person fast:
| Layer | What lives here | Who usually owns it | Failure classes it can cause |
|---|---|---|---|
| Secret + staging labels | The versioned value, AWSCURRENT/AWSPENDING |
App / platform | Stuck pending, desynced labels |
| Rotation Lambda | The four-step state machine | App / platform | Non-idempotent step, timeout |
| VPC / SG / endpoints | Network path to DB and SM API | Network team | Lambda timeout (no route) |
| IAM role + resource policy | Who may rotate / invoke | Security / platform | Permission denied mid-cycle |
| KMS key + key policy | Encryption + cross-account decrypt | Security / KMS admin | Cross-account AccessDenied |
| RDS / Aurora | The backing user(s), master secret | DBA / platform | ALTER USER fails, no clone |
| Consumer app | Cache, refresh, version pinning | Consuming team | Uses dead credential after rotation |
Core concepts
Five mental models make every later diagnosis obvious.
Rotation is a relabel, never an in-place edit. A secret holds one or more versions, each carrying one or more staging labels. Rotation creates a new version, parks it under AWSPENDING, proves it works, then atomically moves AWSCURRENT onto it. Applications reading AWSCURRENT (the default) never see the pending value until cutover completes — that is the entire reason rotation can be zero-downtime. If you ever find yourself “updating” a secret value during rotation, you are off the rails.
Secrets Manager drives your Lambda four times. It invokes your function once per step (createSecret, setSecret, testSecret, finishSecret), passing the step name and the new version’s ID (ClientRequestToken). Your function is a dispatcher on that field. Secrets Manager retries on failure, so every step must be idempotent — running twice must be safe.
RDS rotation has two strategies, and the choice is operational, not cosmetic. Single-user changes the password on the same user every cycle and has a brief window where the old credential is invalid. Alternating-user keeps two users and swaps between them, so the promoted credential always belonged to a user that already existed with a known-good password — no invalid window. The alternating strategy needs a second, elevated master secret to clone the user.
The rotation Lambda must reach two endpoints, and a VPC Lambda loses easy egress. It must reach the database (security-group access on the DB port) and the Secrets Manager API. A Lambda attached to a VPC has no default internet egress, so it needs either a Secrets Manager interface VPC endpoint or a NAT path to call the API. Forgetting this is the single most common rotation failure, and it presents as a bare timeout.
Cross-account sharing requires three independent allows. The owning account’s secret resource policy must grant the consumer; the KMS key must permit the consumer to Decrypt (and it must be a CMK — the managed key cannot be shared); and the consumer’s own identity-based policy must allow GetSecretValue and kms:Decrypt. All three must be true or it silently fails, because each account authorises independently.
The vocabulary in one table
Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the mental model side by side:
| Concept | One-line definition | Where it lives | Why it matters to rotation |
|---|---|---|---|
| Version | An immutable copy of the secret value | The secret | Rotation writes a new one each cycle |
| Staging label | A movable pointer to one version | On a version | AWSCURRENT/AWSPENDING drive cutover |
AWSCURRENT |
The version apps get by default | One version | What consumers should always read |
AWSPENDING |
The in-flight version being tested | One version (transient) | Lingering = a step failed mid-cycle |
AWSPREVIOUS |
The version that was current | One version | One-generation rollback |
| Rotation Lambda | The four-step state machine | Lambda (often in VPC) | The engine; must be idempotent |
Step |
Which of the four phases is running | Invocation event | Your dispatcher switches on it |
masterarn |
Pointer to the elevated master secret | Secret JSON | Required for alternating-user |
| Single-user | Rotate the same DB user | Rotation strategy | Brief invalid-credential window |
| Alternating-user | Swap between two DB users | Rotation strategy | No invalid window (production-grade) |
| SM VPC endpoint | Interface endpoint to the SM API | Private subnets | Lets a VPC Lambda call the API |
| Resource policy | Who else may read this secret | On the secret | Cross-account / cross-principal grant |
| CMK | Customer-managed KMS key | KMS | The only key shareable cross-account |
1. The versioning and staging-label model
Everything in rotation hinges on staging labels. A secret holds one or more versions, each carrying labels. Three are reserved and load-bearing:
AWSCURRENT— the version applications get by default when they callGetSecretValuewithout specifying a version.AWSPENDING— the in-flight version being created and tested during rotation. It does not exist between rotations.AWSPREVIOUS— automatically applied to the version that wasAWSCURRENTonce rotation finishes, so you can roll back one generation.
A single version label points to one version at a time. Rotation is fundamentally the act of creating a new version labelled AWSPENDING, proving it works, then atomically moving AWSCURRENT onto it. Applications that always read AWSCURRENT (the default) never see the pending value until cutover completes. That is the whole reason rotation can be zero-downtime: the new credential is fully provisioned and tested before anything points production at it.
Mental model: rotation never edits a secret in place. It writes a new version, parks it under
AWSPENDING, and only relabelsAWSCURRENTat the very end. If you ever find yourself “updating” a secret value during rotation, you are off the rails.
The three reserved labels, what moves them, and what each means for callers:
| Staging label | Set by | Points to | Exists between rotations? | What a caller gets |
|---|---|---|---|---|
AWSCURRENT |
finishSecret (relabel) |
The live, in-use version | Yes (always exactly one) | Default value from GetSecretValue |
AWSPENDING |
createSecret (PutSecretValue) |
The new version under test | No (transient) | Only if explicitly requested by stage |
AWSPREVIOUS |
Auto, when AWSCURRENT moves |
The prior current version | Yes after first rotation | One generation back, for rollback |
| Custom label | You, via UpdateSecretVersionStage |
Any version you choose | Yes if you set it | Whatever you pin it to (avoid for apps) |
The version-stage invariants you must never violate — break one and rotation desyncs:
| Invariant | Why it holds | What violating it causes |
|---|---|---|
Exactly one version holds AWSCURRENT |
Callers need one unambiguous live value | Apps read an indeterminate credential |
AWSPENDING is removed once promoted |
It only labels the in-flight version | Lingering pending blocks the next cycle |
| A version is immutable once written | Versions are content-addressed by token | “Editing” forks state from what was applied |
AWSPREVIOUS is one generation only |
Rollback target, not history | Treating it as an archive loses older values |
| Labels move atomically | Cutover must be all-or-nothing | Partial moves expose a half-rotated secret |
Inspect the live label-to-version mapping at any time — this is your first diagnostic command in any rotation incident:
aws secretsmanager describe-secret --secret-id prod/payments/db-app \
--query 'VersionIdsToStages'
# Healthy: one version → ["AWSCURRENT"], one → ["AWSPREVIOUS"], NO "AWSPENDING".
# A lingering "AWSPENDING" means a step failed mid-cycle.
2. The four-step rotation Lambda
Secrets Manager drives rotation by invoking your Lambda four times, passing a Step field in the event each time. Your function is a dispatcher on that field. The contract is fixed:
{
"SecretId": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/app/db-AbCdEf",
"ClientRequestToken": "uuid-of-the-new-version",
"Step": "createSecret"
}
The ClientRequestToken is the version ID of the new (AWSPENDING) version. What each step must do, and the idempotency rule that keeps retries safe:
| Step | Responsibility | Idempotency rule | Failure if you skip it |
|---|---|---|---|
createSecret |
Generate the new credential and store it as AWSPENDING. |
If AWSPENDING already exists, do nothing. |
Each retry regenerates → desync vs setSecret |
setSecret |
Apply the pending credential to the backing service (e.g. ALTER USER). |
Must tolerate being run twice. | Double-apply or partial apply on retry |
testSecret |
Connect/authenticate using the pending credential. | Read-only validation. | A broken credential gets promoted |
finishSecret |
Move AWSCURRENT to the pending version. |
Return early if already promoted. | Cutover repeats or strands the label |
The four steps as a state machine — what is true before and after each, and who acts:
| Phase | Before | Action | After | Actor |
|---|---|---|---|---|
| Start | One version: AWSCURRENT |
SM allocates a new version ID | Token reserved for AWSPENDING |
Secrets Manager |
createSecret |
No AWSPENDING |
Generate value, PutSecretValue |
New version labelled AWSPENDING |
Your Lambda |
setSecret |
Pending exists, not yet live in DB | Apply to backing service | DB accepts the pending credential | Your Lambda |
testSecret |
Pending applied | Authenticate with pending | Pending proven good | Your Lambda |
finishSecret |
Pending good | Move AWSCURRENT → pending |
Pending becomes current; old → previous | Your Lambda |
The most common bug is non-idempotent steps. Secrets Manager retries; if createSecret blindly generates a new password every invocation, you desync the stored secret from what setSecret applied. Always check for an existing AWSPENDING first.
def create_secret(service_client, arn, token):
# Source of truth is AWSCURRENT; we derive the new value from it.
current = service_client.get_secret_value(SecretId=arn, VersionStage="AWSCURRENT")
try:
service_client.get_secret_value(
SecretId=arn, VersionId=token, VersionStage="AWSPENDING"
)
# Pending already staged on a retry — do not regenerate.
return
except service_client.exceptions.ResourceNotFoundException:
pass
secret = json.loads(current["SecretString"])
secret["password"] = service_client.get_random_password(
PasswordLength=32, ExcludePunctuation=True
)["RandomPassword"]
service_client.put_secret_value(
SecretId=arn,
ClientRequestToken=token,
SecretString=json.dumps(secret),
VersionStages=["AWSPENDING"],
)
The finishSecret step is equally precise — it is a relabel, not a write. Find the version currently holding AWSCURRENT (returning early if it is already token, i.e. a retry), then move the label:
def finish_secret(service_client, arn, token, current_version):
if current_version == token:
return # Already promoted (retry).
service_client.update_secret_version_stage(
SecretId=arn,
VersionStage="AWSCURRENT",
MoveToVersionId=token,
RemoveFromVersionId=current_version,
)
For most database engines you do not write this yourself — AWS publishes managed rotation functions as serverless application templates. The managed functions you should reach for before writing custom code:
| Backing service | Managed function family | Strategies offered | Write custom instead? |
|---|---|---|---|
| RDS/Aurora PostgreSQL | SecretsManagerRDSPostgreSQLRotation… |
Single-user, MultiUser | No — use managed |
| RDS/Aurora MySQL/MariaDB | SecretsManagerRDSMySQLRotation… |
Single-user, MultiUser | No — use managed |
| RDS SQL Server | SecretsManagerRDSSQLServerRotation… |
Single-user, MultiUser | No — use managed |
| RDS Oracle | SecretsManagerRDSOracleRotation… |
Single-user, MultiUser | No — use managed |
| Amazon Redshift | SecretsManagerRedshiftRotation… |
Single-user, MultiUser | No — use managed |
| DocumentDB | SecretsManagerMongoDBRotation… |
Single-user, MultiUser | No — use managed |
| Generic / third-party API | SecretsManagerRotationTemplate |
You implement | Yes — start from the template |
| OS / SSH / non-API target | (none) | — | Yes — full custom |
The dispatcher skeleton that ties the four steps together — note the explicit guard that rotation is even enabled and that the version is pending:
def lambda_handler(event, context):
arn, token, step = event["SecretId"], event["ClientRequestToken"], event["Step"]
client = boto3.client("secretsmanager")
md = client.describe_secret(SecretId=arn)
if not md.get("RotationEnabled"):
raise ValueError(f"Rotation not enabled for {arn}")
versions = md["VersionIdsToStages"]
if token not in versions:
raise ValueError(f"Version {token} has no stage for {arn}")
if "AWSCURRENT" in versions[token]:
return # Already current — nothing to do.
current_version = next(v for v, s in versions.items() if "AWSCURRENT" in s)
{
"createSecret": lambda: create_secret(client, arn, token),
"setSecret": lambda: set_secret(client, arn, token),
"testSecret": lambda: test_secret(client, arn, token),
"finishSecret": lambda: finish_secret(client, arn, token, current_version),
}[step]()
3. RDS rotation: single-user vs alternating-user
For RDS, RDS Proxy, DocumentDB, and Redshift, you choose a rotation strategy, and the choice has real operational consequences.
Single-user rotation changes the password on the same database user every cycle. Simple, but hazardous: between setSecret (which runs ALTER USER ... PASSWORD) and the moment your application picks up the new AWSCURRENT, any connection attempt with the old password fails. If your app caches the secret and only refreshes on auth failure, you get a brief window of errors every rotation. Acceptable for low-traffic or fault-tolerant apps; not for a hot path.
Alternating-user rotation (the “multi-user” strategy) is the production-grade choice. It clones the existing user into a second account and alternates between the two each cycle. Cycle N uses app_user; cycle N+1 rotates app_user_clone, swaps AWSCURRENT to it, and leaves the previous user fully valid. Because the newly promoted credential belongs to a user that already existed with a known-good password, there is no window where the current credential is invalid.
| Strategy | Users | Rotation-window failures | Setup requirement | Best for |
|---|---|---|---|---|
| Single-user | 1 | Possible (brief) | App user can ALTER itself |
Low-traffic, fault-tolerant apps |
| Alternating-user | 2 | None if app respects AWSCURRENT |
A superuser secret to clone the user | Hot paths, fleets, strict SLAs |
A fuller side-by-side, because the trade-offs go beyond the failure window:
| Dimension | Single-user | Alternating-user |
|---|---|---|
| DB users required | 1 | 2 (original + clone) |
| Master/superuser secret | Not required | Required (masterarn) |
| Invalid-credential window | Yes (until cache refresh) | None |
| Privileges the rotator needs | ALTER USER on self |
CREATE/ALTER USER, clone grants |
| Behaviour on app cache lag | Auth errors until refresh | Old user still valid → no errors |
| Rollback safety | One generation | One generation + prior user live |
| Operational complexity | Lower | Higher (two users to reason about) |
| Recommended for hot paths | No | Yes |
Alternating-user requires a second secret — the master/superuser secret — referenced via masterarn, because cloning a user and granting its privileges needs elevated rights the application user does not have. The rotation Lambda authenticates with the master secret to provision the clone, then rotates the application secret.
# Enable alternating-user rotation on an RDS secret using the AWS-managed
# Postgres rotation function, every 30 days.
aws secretsmanager rotate-secret \
--secret-id prod/payments/db-app \
--rotation-lambda-arn arn:aws:lambda:us-east-1:111122223333:function:SecretsManagerRDSPostgreSQLRotationMultiUser \
--rotation-rules '{"ScheduleExpression": "rate(30 days)", "Duration": "2h"}'
The application’s secret must declare which strategy it uses. A multi-user Postgres secret looks like this — every field below is load-bearing for the managed function:
{
"engine": "postgres",
"host": "payments.cluster-abc123.us-east-1.rds.amazonaws.com",
"username": "app_user",
"password": "current-password",
"dbname": "payments",
"port": 5432,
"masterarn": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/payments/db-master-XyZ"
}
The required and optional keys in an RDS secret JSON, and what the managed function does with each:
| Key | Required? | Used by | Notes / gotcha |
|---|---|---|---|
engine |
Yes | Function (driver select) | postgres, mysql, oracle, sqlserver, etc. |
host |
Yes | Connect | Cluster endpoint; use the writer for DDL |
port |
Yes | Connect | 5432 PG, 3306 MySQL, 1521 Oracle, 1433 MSSQL |
username |
Yes | ALTER/auth |
The rotating app user |
password |
Yes | Auth | The function overwrites this each cycle |
dbname |
Often | Connect | Some engines require it to authenticate |
masterarn |
Alternating only | Clone the user | Omit → single-user; present → multi-user |
dbInstanceIdentifier |
Optional | Resolve host | Alternative to host for some templates |
The schedule controls in --rotation-rules, and how each behaves:
| Field | Meaning | Example | Notes |
|---|---|---|---|
ScheduleExpression |
When rotation fires | rate(30 days) / cron(0 3 1 * ? *) |
rate minimum is effectively daily |
Duration |
Rotation window length | "2h" |
SM starts within this window, not exactly on the dot |
AutomaticallyAfterDays |
Legacy interval (days) | 30 |
Superseded by ScheduleExpression |
| (manual trigger) | rotate-secret with no rules |
— | Runs all four steps immediately, once |
Put RDS Proxy in front of this. The proxy keeps a warm connection pool and integrates natively with Secrets Manager, so when AWSCURRENT flips it picks up the new credential without each application instance re-reading the secret. Combined with alternating-user rotation, cutover becomes a non-event — covered in depth in RDS Proxy: Connection Pooling, Failover, IAM Auth.
4. VPC, networking, and KMS permissions
This is where rotation breaks in practice, and the failure is always the same flavour: the Lambda times out with no useful error. The function has to reach two endpoints — the database and the Secrets Manager API — and both paths must be open.
If your database is in private subnets, the rotation Lambda must attach to the same VPC with security-group access to the DB port. But a VPC Lambda loses default internet egress and still needs to call Secrets Manager. The two ways to give it API access, side by side:
| Egress option | How it works | Cost | When to choose | Gotcha |
|---|---|---|---|---|
| Secrets Manager interface endpoint | com.amazonaws.<region>.secretsmanager ENI in the Lambda subnets |
~hourly/ENI + per-GB | Private workloads (the right answer) | Needs private_dns_enabled and an SG allowing 443 |
| NAT gateway | Route 0.0.0.0/0 → NAT → IGW | Hourly + per-GB egress | You already run NAT for other reasons | Pays egress; path is public-ish |
| (none — Lambda not in VPC) | Public Lambda reaches SM directly | None extra | DB is publicly reachable (rare/bad) | Can’t reach a private DB at all |
The two endpoints the rotation Lambda must reach, and what blocks each:
| Target | Reached over | Requires | If blocked you see |
|---|---|---|---|
| RDS/Aurora (DB port) | VPC, in-subnet | SG ingress on 5432/3306 from rotator SG | Timeout at setSecret/testSecret |
| Secrets Manager API | Interface endpoint or NAT | Endpoint + SG 443, or NAT route | Timeout before any step logs |
| KMS (decrypt the secret) | Endpoint or NAT | kms:Decrypt + reachable KMS |
AccessDenied or KMS timeout |
| (alternating) Master secret read | Secrets Manager API | Same as SM API + read on master | setSecret cannot clone the user |
resource "aws_vpc_endpoint" "secretsmanager" {
vpc_id = var.vpc_id
service_name = "com.amazonaws.${var.region}.secretsmanager"
vpc_endpoint_type = "Interface"
subnet_ids = var.private_subnet_ids
security_group_ids = [aws_security_group.sm_endpoint.id]
private_dns_enabled = true
}
# The rotation Lambda's SG must be allowed inbound on the DB port.
resource "aws_security_group_rule" "db_from_rotator" {
type = "ingress"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_group_id = aws_security_group.rds.id
source_security_group_id = aws_security_group.rotation_lambda.id
}
For KMS: if the secret is encrypted with a customer-managed key (CMK) rather than the AWS-managed aws/secretsmanager key — and at scale it should be, for cross-account and auditability reasons — the rotation Lambda’s execution role needs kms:Decrypt and kms:GenerateDataKey on that key, and the key policy must permit Secrets Manager. The role also needs the standard rotation permissions plus secretsmanager:GetRandomPassword.
The exact IAM actions the rotation execution role needs, and why each is there:
| Action | On resource | Why the rotator needs it |
|---|---|---|
secretsmanager:DescribeSecret |
The secret | Read rotation state and version stages |
secretsmanager:GetSecretValue |
The secret (+ master) | Read AWSCURRENT/AWSPENDING, master to clone |
secretsmanager:PutSecretValue |
The secret | Write the new AWSPENDING version |
secretsmanager:UpdateSecretVersionStage |
The secret | Move AWSCURRENT at finishSecret |
secretsmanager:GetRandomPassword |
* |
Generate the new password |
kms:Decrypt |
The CMK | Decrypt the stored secret |
kms:GenerateDataKey |
The CMK | Encrypt the new version |
ec2:CreateNetworkInterface etc. |
* |
VPC Lambda ENI management (managed policy) |
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"secretsmanager:DescribeSecret",
"secretsmanager:GetSecretValue",
"secretsmanager:PutSecretValue",
"secretsmanager:UpdateSecretVersionStage"
],
"Resource": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/*",
"Condition": {
"StringEquals": { "aws:ResourceTag/rotation": "managed" }
}
},
{
"Effect": "Allow",
"Action": "secretsmanager:GetRandomPassword",
"Resource": "*"
},
{
"Effect": "Allow",
"Action": ["kms:Decrypt", "kms:GenerateDataKey"],
"Resource": "arn:aws:kms:us-east-1:111122223333:key/<cmk-id>"
}
]
}
Secrets Manager invokes your function on your behalf, so the Lambda’s resource policy must grant lambda:InvokeFunction to secretsmanager.amazonaws.com. The AWS console wires this automatically; in IaC you add it explicitly:
resource "aws_lambda_permission" "allow_secretsmanager" {
statement_id = "AllowSecretsManagerInvoke"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.rotator.function_name
principal = "secretsmanager.amazonaws.com"
}
The three policies rotation touches — confusing them is a common time-sink, so keep them straight:
| Policy | Attached to | Direction | Grants |
|---|---|---|---|
| Execution role (identity) | The Lambda | What the Lambda may do | SM read/write, KMS, ENI |
| Lambda resource policy | The Lambda | Who may invoke it | secretsmanager.amazonaws.com invoke |
| KMS key policy | The CMK | Who may use the key | SM service + rotator + (cross-acct) consumer |
| Secret resource policy | The secret | Who else may read it | Consumer account (cross-account) |
5. A custom rotator for non-RDS credentials
For third-party API keys, a MongoDB Atlas user, or an internal service token, there is no managed function — you implement the four steps against the provider’s API. The shape is identical; only setSecret and testSecret change. The key discipline: many providers do not let you set a key value — they generate and return it. There, createSecret and setSecret collapse — you call the provider’s “create credential” API once, store the result as AWSPENDING, and make setSecret a near no-op. Clean up the old credential only in finishSecret, never before — deleting the old key before the new one is promoted is how you cause an outage.
The two provider archetypes and how the four steps adapt to each:
| Step | Provider accepts a value (DB-like) | Provider issues the value (API-key-like) |
|---|---|---|
createSecret |
Generate value, store AWSPENDING |
Call provider “create key”, store returned value as AWSPENDING |
setSecret |
Apply value to the service (ALTER USER) |
Near no-op (key already live at provider) |
testSecret |
Authenticate with pending value | Make an authenticated call with pending key |
finishSecret |
Move AWSCURRENT |
Move AWSCURRENT, then delete the old provider key |
def set_secret(service_client, arn, token):
pending = json.loads(
service_client.get_secret_value(
SecretId=arn, VersionId=token, VersionStage="AWSPENDING"
)["SecretString"]
)
# Idempotent: only create at the provider if this pending key is not live yet.
if not provider_key_exists(pending["api_key_id"]):
provider_create_key(pending["api_key_id"], pending["api_key_secret"])
def test_secret(service_client, arn, token):
pending = json.loads(
service_client.get_secret_value(
SecretId=arn, VersionId=token, VersionStage="AWSPENDING"
)["SecretString"]
)
resp = provider_authenticated_call(pending["api_key_secret"])
if resp.status_code != 200:
raise ValueError("Pending credential failed validation")
Custom-rotator pitfalls that cause real outages, and the discipline that avoids each:
| Pitfall | What happens | Discipline |
|---|---|---|
Delete old key in setSecret |
Old key dies before promotion → outage | Delete only in finishSecret, after cutover |
| Non-idempotent provider create | Retry creates duplicate keys | Check provider_key_exists before create |
No testSecret call |
Broken key gets promoted | Always make a real authenticated call |
| Assuming you can set the key value | Provider rejects/ignores it | Store what the provider returns |
| Hard provider rate limits | Create/delete throttled mid-cycle | Back off; keep the window (Duration) generous |
| Leaking the old key forever | Credential sprawl | finishSecret revokes the prior credential |
Deploy and schedule a custom rotator the same way, pointing --rotation-lambda-arn at your function. Validate one step at a time before enabling the schedule — rotate-secret runs all four immediately, and you do not want to discover a broken finishSecret in production.
6. Cross-account sharing
A central account often owns secrets that spoke-account workloads consume — a shared database credential, a partner API key. Two things must both be true or it silently fails: a resource policy on the secret, and KMS permissions on the encrypting key. And on the consumer side, the principal needs its own identity policy. Three independent allows, evaluated in three different places.
The three grants cross-account sharing requires, where each lives, and the symptom when it is the one missing:
| Grant | Lives in | Account | Symptom if missing |
|---|---|---|---|
| Secret resource policy | On the secret | Owner | AccessDenied on GetSecretValue |
KMS key policy Decrypt |
On the CMK | Owner | AccessDenied on kms:Decrypt (can’t decrypt) |
| Identity policy | On the principal | Consumer | AccessDenied even though resource policy allows |
CMK (not aws/sm) |
KMS choice | Owner | Managed key can’t be shared at all |
The resource policy grants the consumer account access to the secret:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::444455556666:root" },
"Action": ["secretsmanager:GetSecretValue", "secretsmanager:DescribeSecret"],
"Resource": "*"
}
]
}
The trap: you cannot use the AWS-managed aws/secretsmanager key for cross-account access — its key policy is not editable to permit another account. Encrypt cross-account secrets with a customer-managed CMK and grant the consumer account kms:Decrypt in the key policy:
{
"Sid": "AllowConsumerDecrypt",
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::444455556666:root" },
"Action": "kms:Decrypt",
"Resource": "*",
"Condition": {
"StringEquals": {
"kms:ViaService": "secretsmanager.us-east-1.amazonaws.com"
}
}
}
Why aws/secretsmanager cannot go cross-account, contrasted with a CMK — this single distinction is the most common cross-account failure:
| Property | aws/secretsmanager (managed) |
Customer-managed CMK |
|---|---|---|
| Key policy editable | No | Yes |
| Cross-account grant possible | No | Yes |
| Cost | Free | ~$1/key/month + API |
| Rotation of the key itself | AWS-managed | You choose (annual+) |
| Auditability granularity | Coarse | Per-key CloudTrail |
| Use for shared secrets | Never | Always |
On the consumer side, the IAM principal still needs its own identity-based policy allowing secretsmanager:GetSecretValue and kms:Decrypt against those ARNs — resource policy and identity policy must both allow, since each account authorizes independently. Confirm the full chain with the policy simulator before you trust it:
# From the CONSUMER account: does this role actually get the secret?
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::444455556666:role/app-role \
--action-names secretsmanager:GetSecretValue kms:Decrypt \
--resource-arns arn:aws:secretsmanager:us-east-1:111122223333:secret:shared/partner-api-XyZ \
--query 'EvaluationResults[].{action:EvalActionName, decision:EvalDecision}'
Rotation keeps running in the owning account; consumers always resolve AWSCURRENT and so follow rotation automatically. The deeper cross-account mechanics (external ID, confused-deputy, session policies) are in IAM Cross-Account Roles, External ID, Confused Deputy; the KMS half is in AWS KMS Encryption Deep Dive.
7. Application integration without rotation-window failures
The cache makes or breaks the consumer experience. Calling GetSecretValue on every request is slow and hits API throttling. The right pattern is the AWS caching client (Java, Python, Go, .NET, or the Lambda extension), which caches AWSCURRENT in memory and refreshes on an interval.
from aws_secretsmanager_caching import SecretCache, SecretCacheConfig
import boto3
client = boto3.client("secretsmanager")
cache = SecretCache(config=SecretCacheConfig(secret_refresh_interval=3600), client=client)
def get_connection():
secret = json.loads(cache.get_secret_string("prod/payments/db-app"))
return connect(secret) # On auth failure, force-refresh and retry once.
The integration approaches ranked from worst to best, with the trade-off each makes:
| Pattern | Latency | API cost | Rotation-safe? | Verdict |
|---|---|---|---|---|
GetSecretValue per request |
High | High (throttles) | Yes (always fresh) | Never do this |
| Read once at boot, never refresh | Low | Minimal | No — misses rotation | Breaks on every cycle |
| Caching client (TTL) | Low | Low | Yes (refreshes) | Recommended for services |
| Caching client + refresh-on-auth-fail | Low | Low | Yes + self-heals | Best for single-user |
| Lambda extension (localhost cache) | Lowest | Low | Yes | Best for serverless |
| RDS Proxy (proxy holds the secret) | Lowest | None (app) | Yes — cutover at proxy | Best for shared DB creds |
Two rules keep you safe across rotations:
- Refresh on auth failure, not only on a timer. If a DB connection is rejected, invalidate the cache entry, re-read
AWSCURRENT, and retry once. With alternating-user rotation this almost never fires, but it is your safety net for single-user. - Never pin a
VersionId. ReadAWSCURRENT(the default). Pinning a version means you never follow rotation — the cardinal sin.
The caching-client knobs and how to reason about each:
| Setting | What it does | Default | When to change |
|---|---|---|---|
secret_refresh_interval |
TTL before re-reading AWSCURRENT |
3600 s | Lower for faster pickup; higher to cut API calls |
max_cache_size |
How many secrets cached | 1024 | Raise if a process reads many secrets |
exception_retry_delay_base |
Backoff on API errors | seconds | Tune under throttling |
| Force-refresh on auth fail | Re-read immediately on rejection | your code | Always implement for single-user |
| Version stage requested | Which label to read | AWSCURRENT |
Leave default — never pin a version |
The Lambda extension is cleanest for serverless consumers: it runs a local HTTP cache with built-in TTL, so your function makes a localhost call instead of hitting the Secrets Manager API.
8. Monitoring rotation health and staleness
Rotation that fails silently is worse than no rotation, because you believe you are covered. Wire three signals:
RotationFailed— Secrets Manager emits this event on failure; route it through EventBridge to SNS.RotationSucceeded— track it to detect absence. A secret that has not rotated on schedule is the real risk, visible only by alarming on staleness.- Last-rotated age — a scheduled check comparing
LastRotatedDateagainst the schedule, alerting on overdue secrets.
The signals to wire, what each tells you, and how to surface it:
| Signal | Source | Tells you | Surface via |
|---|---|---|---|
RotationFailed |
SM event (CloudTrail) | A cycle broke right now | EventBridge → SNS/PagerDuty |
RotationSucceeded |
SM event (CloudTrail) | A cycle completed | Metric; alarm on absence |
LastRotatedDate age |
describe-secret |
Secret is overdue | Scheduled Lambda / Config |
Lingering AWSPENDING |
VersionIdsToStages |
A step failed mid-cycle | Scheduled check / dashboard |
secretsmanager-rotation-enabled-check |
AWS Config rule | Rotation not even enabled | Config dashboard / SNS |
secretsmanager-secret-periodic-rotation |
AWS Config rule | Not rotated within max age | Config dashboard / SNS |
{
"source": ["aws.secretsmanager"],
"detail-type": ["AWS Service Event via CloudTrail"],
"detail": {
"eventName": ["RotationFailed"]
}
}
For staleness, AWS Config is the durable approach. Its managed rules secretsmanager-rotation-enabled-check and secretsmanager-secret-periodic-rotation flag secrets without rotation enabled or not rotated within a maximum age — far better than a custom script you will forget to maintain. Pair this with the CloudTrail/CloudWatch foundations in CloudWatch & CloudTrail Observability Deep Dive.
Architecture at a glance
The diagram traces a single rotation across two accounts, left to right. On the far left, the consumer account (444455556666) runs the app — or, better, RDS Proxy — which only ever reads AWSCURRENT through the caching client, and a consumer IAM role that must carry its own GetSecretValue + kms:Decrypt grant. Next is the owner account (111122223333), which holds the secret (with its AWSCURRENT/AWSPENDING versions), the resource policy that grants the consumer, and the customer-managed CMK — explicitly not aws/secretsmanager, because the managed key cannot be shared. The third zone is the rotation tier living inside the database VPC: the four-step rotation Lambda and the Secrets Manager interface VPC endpoint that lets that VPC-bound Lambda reach the SM API at all. The right-most zone is the data tier in private subnets: the Aurora PostgreSQL cluster on 5432 with its two alternating users, and the elevated master secret whose superuser clones the application user.
Follow the flows: the consumer calls GetSecretValue and receives whatever AWSCURRENT points at; Secrets Manager invokes the rotation Lambda through its four steps; the Lambda reaches the database on 5432 to ALTER USER, then writes the new version back with PutSecretValue. The five numbered badges sit on exactly the hops that stall in practice — the Lambda losing its route to the DB (1), the missing SM endpoint (2), the master secret needed to clone (3), the un-shareable managed key (4), and the consumer’s missing identity grant (5) — and the legend narrates each as symptom, how to confirm, and the fix. Read the badges as the failure map laid directly over the architecture.
Real-world scenario
A payments platform team — call them NorthPay — ran 40+ microservices against an Aurora PostgreSQL cluster, all sharing one application credential rotated single-user every 7 days. Every rotation produced a 20-to-90 second spike of 500s as services hit auth failures and slowly refreshed their caches. The on-call playbook literally said “ignore the Thursday 02:00 error spike” — which is exactly how a real incident later got ignored for eleven minutes, because operators had been trained to wave off the Thursday alarm.
The constraints were hard: they could not coordinate a synchronized restart of 40 services, and the payments hot path’s error budget (a 99.95% SLO, ~21 minutes/month) was being burned by the rotation spike alone — four cycles a month at up to 90 seconds each was a meaningful fraction of the entire budget, spent on a self-inflicted event. The fix had two moves. First, they switched the secret to alternating-user rotation with a master secret, so the promoted credential always belonged to a user that already existed with a valid password — eliminating the invalid-credential window. Second, they put RDS Proxy in front of the cluster so cutover happened once, at the proxy, instead of independently in 40 connection pools.
aws rds create-db-proxy \
--db-proxy-name payments-proxy \
--engine-family POSTGRESQL \
--auth '[{"AuthScheme":"SECRETS","SecretArn":"arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/payments/db-app-XyZ","IAMAuth":"DISABLED"}]' \
--role-arn arn:aws:iam::111122223333:role/payments-proxy-role \
--vpc-subnet-ids subnet-aaa subnet-bbb
There was a second-order problem the team only found in staging: the rotation Lambda had originally been deployed outside the VPC (it had worked because the dev database was publicly reachable), and moving it into the production VPC immediately broke it with a bare timeout. The cause was the classic missing Secrets Manager interface endpoint — the VPC Lambda could now reach the database but had lost its path to the SM API. Adding com.amazonaws.us-east-1.secretsmanager with private DNS in the Lambda’s subnets fixed it in one change. They also moved the secret off aws/secretsmanager onto a CMK in the same motion, because a sibling analytics account needed read access and the managed key could never grant it.
The before/after, measured over a quarter:
| Metric | Before (single-user, no proxy) | After (alternating-user + proxy) |
|---|---|---|
| Rotation-window 5xx | 20–90 s spike every cycle | Zero across the quarter |
| Error budget spent on rotation | ~30% of monthly SLO | ~0% |
| Services needing a restart on rotation | Up to 40 (uncoordinated) | 0 (cutover at proxy) |
| Cross-account analytics read | Impossible (aws/sm key) |
Working (CMK grant) |
| “Ignore the Thursday spike” runbook line | Present | Deleted |
After the change, rotation-window errors went to zero, and the “ignore the Thursday spike” line was deleted from the runbook — the real win, because a runbook that trains operators to ignore alerts is a latent incident.
Advantages and disadvantages
Automatic rotation is the right default for any credential that can rotate, but it is not free of cost or risk. The honest trade-off:
| Advantages | Disadvantages |
|---|---|
| Short-lived credentials shrink blast radius | Adds a moving part (the Lambda) that can fail |
| Zero-downtime cutover (alternating-user) | Networking/KMS wiring is unforgiving when wrong |
Consumers follow AWSCURRENT automatically |
Silent failure is possible without monitoring |
| Managed functions cover most databases | Custom targets need real engineering |
| Native audit via CloudTrail | Per-secret + per-API + CMK cost at scale |
| Cross-account sharing without copying secrets | Three-place authorization is easy to get wrong |
| Compliance: provable rotation cadence | Stuck AWSPENDING needs operational hygiene |
When each side dominates: for production database credentials on a hot path, the advantages are decisive — alternating-user + RDS Proxy turns the biggest risk (the cutover window) into a non-event, and the cost is rounding error against the cost of a leaked standing password. For a rarely-used, low-value, easily-revoked credential, the disadvantages can outweigh the benefit — a rotation Lambda you must maintain for a secret nobody attacks is overhead; a long manual-rotation interval may be enough. For third-party credentials with hard rate limits or fragile APIs, weigh the custom-rotator engineering against simply rotating manually on a calendar.
Hands-on lab
This lab stands up single-user rotation on an RDS PostgreSQL secret end to end, triggers a manual cycle, verifies the staging labels moved, and tears everything down. It assumes an existing RDS PostgreSQL instance reachable from where you run the rotation function; for a fully free-tier path, use a db.t3.micro (free tier eligible for 12 months) in a public subnet so you can skip the VPC endpoint for the lab (never do this in production).
The lab at a glance:
| Step | Action | Command | Expected result |
|---|---|---|---|
| 1 | Create the secret with engine metadata | create-secret |
Secret ARN returned |
| 2 | Deploy the managed rotation function | SAR / console | Lambda ARN returned |
| 3 | Grant SM invoke on the Lambda | add-permission |
Statement added |
| 4 | Enable rotation | rotate-secret --rotation-lambda-arn |
VersionId for pending |
| 5 | Trigger a manual cycle | rotate-secret |
All four steps run |
| 6 | Verify labels moved | describe-secret |
New AWSCURRENT, no AWSPENDING |
| 7 | Confirm it authenticates | get-secret-value + psql |
Login succeeds |
| 8 | Tear down | delete-secret, delete Lambda |
Resources gone |
Step 1 — create the secret with the engine metadata the managed function needs:
aws secretsmanager create-secret \
--name lab/rotation/pg-app \
--secret-string '{
"engine":"postgres",
"host":"lab-db.abc123.us-east-1.rds.amazonaws.com",
"username":"app_user",
"password":"InitialPassw0rd!",
"dbname":"appdb",
"port":5432
}'
Step 2 — deploy the managed Postgres single-user rotation function from the Serverless Application Repository (the console “Edit rotation” wizard does this for you; the CLI path uses aws serverlessrepo create-cloud-formation-change-set). For the lab, the console wizard is fastest: open the secret → Rotation → Edit rotation → Create a new Lambda function → choose the single-user template.
Step 3 — grant Secrets Manager permission to invoke the function (the console does this; shown here for completeness):
aws lambda add-permission \
--function-name SecretsManagerRotationLabFn \
--statement-id AllowSM \
--action lambda:InvokeFunction \
--principal secretsmanager.amazonaws.com
Step 4 — enable rotation pointing at the function, on a 30-day schedule:
aws secretsmanager rotate-secret \
--secret-id lab/rotation/pg-app \
--rotation-lambda-arn arn:aws:lambda:us-east-1:111122223333:function:SecretsManagerRotationLabFn \
--rotation-rules '{"ScheduleExpression":"rate(30 days)"}'
Step 5 — trigger a manual cycle now (runs all four steps immediately):
aws secretsmanager rotate-secret --secret-id lab/rotation/pg-app
Step 6 — verify the labels moved. This is the moment of truth:
aws secretsmanager describe-secret --secret-id lab/rotation/pg-app \
--query '{Rotated:LastRotatedDate, Versions:VersionIdsToStages}'
# Expect: LastRotatedDate just now; one version AWSCURRENT, one AWSPREVIOUS, NO AWSPENDING.
Step 7 — confirm the live credential authenticates:
CRED=$(aws secretsmanager get-secret-value --secret-id lab/rotation/pg-app \
--query SecretString --output text)
PGPASSWORD=$(echo "$CRED" | jq -r .password) \
psql -h $(echo "$CRED" | jq -r .host) -U app_user -d appdb -c '\conninfo'
# "You are connected to database appdb" → rotation produced a working credential.
Step 8 — tear down so you stop paying for the secret and the instance:
aws secretsmanager delete-secret --secret-id lab/rotation/pg-app \
--force-delete-without-recovery
aws lambda delete-function --function-name SecretsManagerRotationLabFn
# Delete the RDS instance from the console or:
aws rds delete-db-instance --db-instance-identifier lab-db \
--skip-final-snapshot --delete-automated-backups
What each teardown action stops costing you:
| Resource | Lingering cost if left | Teardown |
|---|---|---|
| Secret | ~$0.40/month + API calls | delete-secret --force-delete-without-recovery |
| Rotation Lambda | Per-invocation (negligible idle) | delete-function |
| RDS instance | Hourly + storage | delete-db-instance --skip-final-snapshot |
| CMK (if created) | ~$1/month | Schedule key deletion (7–30 day window) |
| VPC endpoint (if created) | Hourly per ENI | delete-vpc-endpoints |
Common mistakes & troubleshooting
This is the section you keep open at 02:00. Each row is a real failure mode: the symptom, the root cause, the exact command or console path to confirm it, and the fix.
| # | Symptom | Root cause | Confirm (exact command / path) | Fix |
|---|---|---|---|---|
| 1 | Lambda times out, secret stuck AWSPENDING |
No route to the DB or the SM API | CloudWatch Logs: function ends mid-step; describe-secret shows lingering AWSPENDING |
Put rotator in DB VPC; add SM interface endpoint; open SG on DB port |
| 2 | Timeout before any step logs | VPC Lambda has no Secrets Manager endpoint | Logs show nothing; ENI exists but no 443 path | Add com.amazonaws.<region>.secretsmanager interface endpoint + private DNS (or NAT) |
| 3 | setSecret fails: permission denied |
Alternating-user without masterarn / weak master |
Logs: permission denied to create/alter user; secret JSON lacks masterarn |
Add masterarn; master secret holds a superuser that can clone |
| 4 | Every step re-runs and desyncs | Steps are not idempotent | Logs show password regenerated each createSecret retry |
Guard createSecret on existing AWSPENDING; guard finishSecret on already-current |
| 5 | Consumer gets AccessDenied on GetSecretValue |
Missing/incorrect secret resource policy | get-resource-policy; simulate-principal-policy |
Add resource policy granting the consumer account |
| 6 | Consumer gets AccessDenied on kms:Decrypt |
Secret on aws/secretsmanager (uneditable) |
describe-secret shows KmsKeyId: alias/aws/secretsmanager |
Re-encrypt with a CMK; grant consumer kms:Decrypt via kms:ViaService |
| 7 | Resource policy allows but consumer still denied | Consumer identity policy missing the grant | simulate-principal-policy from the consumer role |
Add GetSecretValue + kms:Decrypt to the consumer principal |
| 8 | App auth fails right after a clean rotation | App pinned to a VersionId |
App config reads a fixed version, not AWSCURRENT |
Read AWSCURRENT (default); never pin a version |
| 9 | 5xx spike every rotation cycle | Single-user on a hot path | App logs show auth failures clustered at rotation time | Switch to alternating-user; front with RDS Proxy |
| 10 | Secret believed rotating but never does | Rotation disabled or silently failing | describe-secret: RotationEnabled:false or stale LastRotatedDate |
Enable rotation; add Config rule + RotationFailed alarm |
| 11 | rotate-secret returns but nothing changes |
SM can’t invoke the Lambda | Lambda resource policy lacks secretsmanager.amazonaws.com |
Add lambda:InvokeFunction permission for the SM service principal |
| 12 | finishSecret runs forever / loops |
current_version never updated, retry logic wrong |
Logs: UpdateSecretVersionStage repeats |
Return early when current_version == token; move label once |
| 13 | Rotation works in dev, times out in prod | Dev DB public; prod DB private, no endpoint | Same Lambda, only network differs | Add VPC config + SM endpoint + SG ingress in prod |
| 14 | KMS AccessDenied for the rotator itself |
Role missing Decrypt/GenerateDataKey on the CMK |
CloudTrail KMS event AccessDenied |
Grant the execution role both KMS actions on the CMK |
The decision table for the most common ambiguous signal — “rotation isn’t working”:
| If you see… | It’s probably… | Do this |
|---|---|---|
Lingering AWSPENDING, logs stop at setSecret/testSecret |
No DB route or DB-side permission | Check SG ingress + DB grants |
| Timeout with zero step logs | No SM API egress (missing endpoint/NAT) | Add SM interface endpoint |
RotationEnabled:false |
Rotation never enabled | rotate-secret --rotation-lambda-arn … |
LastRotatedDate old, RotationEnabled:true |
Silent recurring failure | Check RotationFailed events / logs |
| Consumer denied, owner can read fine | Cross-account grant incomplete | Verify all three: resource + KMS + identity |
| App fails only at rotation time | Cache/version-pin problem | Read AWSCURRENT; refresh on auth fail |
Best practices
- Use a managed rotation function for any supported database; write custom code only for targets AWS does not cover. The managed functions are battle-tested and idempotent.
- Choose alternating-user for any hot path, and reserve single-user for low-traffic, fault-tolerant credentials where a brief window is acceptable.
- Always front shared database credentials with RDS Proxy so cutover happens once, at the proxy, instead of independently in every connection pool.
- Put the rotation Lambda in the database VPC with a Secrets Manager interface endpoint and security-group access to the DB port — and verify both paths before enabling the schedule.
- Encrypt with a customer-managed CMK, never
aws/secretsmanager, for anything that might go cross-account, needs key-level audit, or needs an editable key policy. - Make every step idempotent: guard
createSecreton an existingAWSPENDING, andfinishSecreton an already-current version. Secrets Manager will retry. - Never delete the old credential before
finishSecretin custom rotators — the old one must stay valid until cutover completes. - Consumers read
AWSCURRENTonly, through the caching client, refreshing on auth failure and never pinning aVersionId. - Monitor for silence, not just failure: alarm on
RotationFailed, and separately alarm on absence ofRotationSucceeded/ staleLastRotatedDatevia AWS Config. - Validate one step at a time in a non-prod environment before enabling the schedule;
rotate-secretruns all four immediately. - Tag secrets for scoped IAM (e.g.
rotation=managed) so the rotator role can be least-privilege via aResourceTagcondition. - Keep the rotation window (
Duration) generous for slow or rate-limited targets so a single retry doesn’t blow the window.
Security notes
Rotation is a security control, so its own posture must be tight. The least-privilege and isolation rules that matter:
| Control | What to do | Why |
|---|---|---|
| Rotator role scope | Limit to specific secret ARNs / tags, not * |
A compromised rotator shouldn’t read every secret |
| KMS key separation | Per-domain CMKs, not one key for everything | Blast-radius isolation; per-key audit |
| Cross-account least privilege | Grant GetSecretValue to roles, not :root, where possible |
Narrow the principal that can read |
kms:ViaService condition |
Restrict KMS use to Secrets Manager | Key can’t be used out-of-band by the grantee |
| Network isolation | Private subnets + interface endpoints | DB and SM traffic never traverses the internet |
| No secrets in logs | Never log secret values in the rotator | CloudWatch Logs is not a secret store |
| CloudTrail on KMS + SM | Log every Decrypt and GetSecretValue |
Detect anomalous reads |
| Resource policy review | Audit who each secret is shared with | Prevent accidental over-sharing |
| Master secret protection | Tightest controls; superuser power | It can clone/alter any DB user |
| Deletion recovery window | Keep the default 7–30 day window in prod | Guard against accidental/malicious deletion |
The encryption and identity story end to end: the secret is envelope-encrypted under a CMK; the rotator decrypts with kms:Decrypt scoped to that key; consumers decrypt only through secretsmanager.<region>.amazonaws.com via the kms:ViaService condition; and every read is a CloudTrail event you can alarm on. For the deeper KMS model see AWS KMS Encryption Deep Dive and Multi-Region KMS Keys & Envelope Encryption; for the IAM evaluation order behind the three-place authorization, IAM Fundamentals: Users, Roles, Policies, Evaluation and IAM Least Privilege & Permission Boundaries.
Cost & sizing
Secrets Manager pricing is simple but adds up at scale; rotation introduces Lambda and (usually) CMK and endpoint costs on top. The drivers:
| Cost driver | Rate (us-east-1, approx) | Scales with | Notes |
|---|---|---|---|
| Secret storage | ~$0.40 / secret / month | Number of secrets | The base line item |
| API calls | ~$0.05 / 10,000 calls | Read volume | The caching client slashes this |
| Rotation Lambda | Per-invocation + duration | Rotations × 4 steps | Tiny; 4 short invocations per cycle |
| Customer-managed CMK | ~$1 / key / month + API | Number of CMKs | Use shared CMKs per domain to bound it |
| SM interface endpoint | ~$0.01/hr per ENI + per-GB | AZs × endpoints | One per AZ in the Lambda subnets |
| NAT (if used instead) | Hourly + per-GB egress | Egress volume | Often costlier than an endpoint |
Right-sizing guidance and rough monthly figures (INR at ~₹84/USD, indicative):
| Scenario | Secrets | CMKs | Endpoints | Rough USD/mo | Rough INR/mo |
|---|---|---|---|---|---|
| Single app, one DB secret | 1 | 0 (managed key) | 0 (public dev) | ~$0.40 | ~₹35 |
| Small prod, private DB | 3 | 1 | 1 (1 AZ) | ~$8–10 | ~₹700–850 |
| Multi-account platform | 30 | 3 | 6 (2 AZ ×3) | ~$60–80 | ~₹5,000–6,700 |
| High-read fleet (no cache) | 10 | 1 | 2 | ~$20 + heavy API | ~₹1,700 + API |
| Same fleet, caching client | 10 | 1 | 2 | ~$15 | ~₹1,250 |
The free-tier and cost-control levers worth knowing:
| Lever | Effect | Caveat |
|---|---|---|
| 30-day free trial per secret | First month free | Per secret, one-time |
| Caching client / Lambda extension | Cuts API calls ~90%+ | Implement refresh-on-auth-fail |
| Shared CMK per domain | Fewer $1/month keys | Coarser blast radius |
| Interface endpoint over NAT | No per-GB egress on SM traffic | Pay per-ENI-hour instead |
| RDS Proxy holds the secret | App makes zero SM calls | Proxy has its own hourly cost |
| Consolidate related values | One secret, fewer line items | Don’t bundle unrelated trust domains |
The cost of not rotating — a leaked standing credential leading to data exfiltration — dwarfs all of the above. Treat rotation cost as insurance premium, not overhead.
Interview & exam questions
Q1. Walk me through the four steps of a rotation Lambda. createSecret generates the new value and stores it as AWSPENDING; setSecret applies it to the backing service; testSecret authenticates with the pending value; finishSecret moves the AWSCURRENT label onto the pending version. Each step must be idempotent because Secrets Manager retries.
Q2. Why is alternating-user rotation safer than single-user for a hot path? Single-user changes the password on one user, creating a window where the old cached credential is invalid until the app refreshes. Alternating-user swaps between two users, so the newly promoted credential belongs to a user that already existed with a known-good password — there is no invalid window.
Q3. A rotation Lambda for a private RDS instance times out with no useful error. What’s the most likely cause? It’s in the VPC and can reach the database but has no path to the Secrets Manager API — it’s missing a Secrets Manager interface VPC endpoint (or a NAT route). The classic tell is a timeout before any step logs.
Q4. Why can’t you use the aws/secretsmanager key for a cross-account secret? Its key policy is AWS-managed and not editable, so you cannot grant another account kms:Decrypt. Cross-account secrets must use a customer-managed CMK whose key policy you can edit to permit the consumer account.
Q5. A consumer account has a resource policy granting it the secret but still gets AccessDenied. Why? Authorization is evaluated independently in each account. The consumer’s own identity-based policy must also allow secretsmanager:GetSecretValue (and kms:Decrypt). Resource policy and identity policy must both allow.
Q6. What is masterarn and when is it required? It points to an elevated master/superuser secret. Alternating-user rotation needs it because cloning a user and granting its privileges requires rights the application user does not have; the Lambda authenticates with the master secret to provision the clone.
Q7. Why must rotation steps be idempotent, and where does it bite hardest? Secrets Manager retries on failure. If createSecret regenerates a password on every retry, the stored secret desyncs from what setSecret applied. Guard createSecret on an existing AWSPENDING and finishSecret on an already-current version.
Q8. How do you keep applications from failing during the rotation window? Read AWSCURRENT via the caching client (never pin a VersionId), refresh the cache on auth failure, prefer alternating-user rotation, and front shared DB credentials with RDS Proxy so cutover happens once at the proxy.
Q9. How do you detect a secret that silently stopped rotating? Alarm on the absence of RotationSucceeded, not just on RotationFailed. Use AWS Config rules secretsmanager-rotation-enabled-check and secretsmanager-secret-periodic-rotation, and check LastRotatedDate against the schedule.
Q10. In a custom rotator where the provider issues the key, when do you delete the old credential? Only in finishSecret, after AWSCURRENT has moved. Deleting the old key in createSecret or setSecret kills it before promotion and causes an outage.
Q11. What lingering AWSPENDING tells you, and how to recover. A step failed mid-cycle — usually setSecret/testSecret couldn’t reach the DB, or a permission was missing. Confirm with describe-secret’s VersionIdsToStages, fix the underlying cause (network/permission), then re-trigger rotate-secret.
Q12. Which IAM permissions does the rotation execution role need beyond the obvious secret actions? secretsmanager:GetRandomPassword (on *), kms:Decrypt and kms:GenerateDataKey on the CMK, plus the VPC ENI permissions (via the AWS managed Lambda-VPC policy) when it runs in a VPC.
These map to AWS Certified Security – Specialty (data protection, key management, incident response) and AWS Certified Solutions Architect – Professional (secure multi-account design). The cross-account KMS and resource-policy material is a recurring Security-Specialty theme.
Quick check
- Which staging label do applications get by default from
GetSecretValue, and which one only exists during a rotation? - You enable rotation on a private RDS secret and the Lambda times out before any step logs. What’s missing?
- A spoke account has a secret resource policy granting it access but still gets
AccessDeniedonGetSecretValue. Name the two things that must also be true. - Why does alternating-user rotation eliminate the rotation-window auth errors that single-user can cause?
- In a custom rotator for a provider that issues the key, at which step do you delete the old credential, and why not earlier?
Answers
- Applications get
AWSCURRENTby default;AWSPENDINGexists only during a rotation cycle (and a lingering one means a step failed). - A Secrets Manager interface VPC endpoint (or a NAT route) in the Lambda’s subnets — the VPC Lambda can reach the DB but has no path to the SM API.
- The encrypting key must be a customer-managed CMK granting the consumer
kms:Decrypt(notaws/secretsmanager), and the consumer’s own identity-based policy must allowGetSecretValue+kms:Decrypt. - Because the promoted credential belongs to a second user that already existed with a known-good password, so there is never a moment where the current credential is invalid while the app catches up.
- In
finishSecret, afterAWSCURRENThas moved — deleting it earlier kills the credential before promotion and causes an outage.
Glossary
- Staging label — a movable pointer (
AWSCURRENT,AWSPENDING,AWSPREVIOUS, or custom) that points to exactly one version of a secret at a time. AWSCURRENT— the staging label for the live version returned by default fromGetSecretValue; what consumers should always read.AWSPENDING— the transient label on the new version being created and tested during a rotation cycle; absent between rotations.AWSPREVIOUS— the label automatically applied to the version that wasAWSCURRENTbefore the last rotation, enabling one-generation rollback.- Rotation Lambda — the function Secrets Manager invokes four times per cycle to create, set, test, and finish a new credential version.
Step— the field in the rotation event (createSecret/setSecret/testSecret/finishSecret) that your function dispatches on.- Single-user rotation — strategy that changes the password on one database user each cycle; simple but has a brief invalid-credential window.
- Alternating-user rotation — strategy that swaps between two database users each cycle, eliminating the invalid-credential window; needs a master secret.
masterarn— a secret field pointing to the elevated master/superuser secret used to clone the user in alternating-user rotation.- Interface VPC endpoint — an ENI-backed private path (
com.amazonaws.<region>.secretsmanager) letting a VPC-bound Lambda reach the Secrets Manager API without internet egress. - Customer-managed key (CMK) — a KMS key whose key policy you control; the only key type shareable cross-account for secret encryption.
aws/secretsmanager— the AWS-managed default encryption key; free but with an uneditable policy, so it cannot be shared cross-account.- Resource policy — a policy attached to the secret itself, granting other accounts/principals access (the cross-account half of the grant).
kms:ViaService— a KMS condition key restricting use of the key to requests coming through a named service (e.g. Secrets Manager).- RDS Proxy — a managed connection pool that holds the secret and integrates with Secrets Manager, so rotation cutover happens once at the proxy.
- Caching client — the AWS SDK helper (or Lambda extension) that caches
AWSCURRENTin memory and refreshes on a TTL, cutting API calls and following rotation.
Next steps
- Master the storage and naming fundamentals these mechanics build on in the AWS Secrets Manager & Parameter Store Deep Dive.
- Go deep on the encryption layer — key policies, grants, envelope encryption — in the AWS KMS Encryption Deep Dive and Multi-Region KMS Keys & Envelope Encryption.
- Make rotation cutover a non-event with RDS Proxy: Connection Pooling, Failover, IAM Auth, backed by the AWS RDS & Aurora Deep Dive.
- Nail the cross-account trust model in IAM Cross-Account Roles, External ID, Confused Deputy and the evaluation order in IAM Least Privilege & Permission Boundaries.
- Wire the alerting that catches silent rotation failure using the CloudWatch & CloudTrail Observability Deep Dive.