Storing a secret in Secrets Manager is the easy part. The thing that reduces blast radius is rotation — and rotation is where teams quietly fail, because a botched cycle either locks an application out of its own database or, worse, succeeds silently while applications keep using a credential that no longer works. The mechanics are precise: one staging-label state machine you must respect, a versioning model you cannot shortcut, and networking/KMS permissions that are unforgiving when wrong. This guide walks the four-step model end to end, contrasts single-user and alternating-user RDS strategies, shows how to write a custom rotator for non-RDS credentials, and covers the cross-account sharing platform teams always end up needing.
1. The versioning and staging-label model
Everything in rotation hinges on staging labels. A secret holds one or more versions, each carrying labels. Three are reserved and load-bearing:
AWSCURRENT— the version applications get by default when they callGetSecretValuewithout specifying a version.AWSPENDING— the in-flight version being created and tested during rotation. It does not exist between rotations.AWSPREVIOUS— automatically applied to the version that wasAWSCURRENTonce rotation finishes, so you can roll back one generation.
A single version label points to one version at a time. Rotation is fundamentally the act of creating a new version labelled AWSPENDING, proving it works, then atomically moving AWSCURRENT onto it. Applications that always read AWSCURRENT (the default) never see the pending value until cutover completes. That is the whole reason rotation can be zero-downtime: the new credential is fully provisioned and tested before anything points production at it.
Mental model: rotation never edits a secret in place. It writes a new version, parks it under
AWSPENDING, and only relabelsAWSCURRENTat the very end. If you ever find yourself “updating” a secret value during rotation, you are off the rails.
2. The four-step rotation Lambda
Secrets Manager drives rotation by invoking your Lambda four times, passing a Step field in the event each time. Your function is a dispatcher on that field. The contract is fixed:
{
"SecretId": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/app/db-AbCdEf",
"ClientRequestToken": "uuid-of-the-new-version",
"Step": "createSecret"
}
The ClientRequestToken is the version ID of the new (AWSPENDING) version. What each step must do:
| Step | Responsibility | Idempotency rule |
|---|---|---|
createSecret |
Generate the new credential and store it as AWSPENDING. |
If AWSPENDING already exists, do nothing. |
setSecret |
Apply the pending credential to the backing service (e.g. ALTER USER). |
Must tolerate being run twice. |
testSecret |
Connect/authenticate using the pending credential. | Read-only validation. |
finishSecret |
Move AWSCURRENT to the pending version. |
Marks rotation complete. |
The most common bug is non-idempotent steps. Secrets Manager retries; if createSecret blindly generates a new password every invocation, you desync the stored secret from what setSecret applied. Always check for an existing AWSPENDING first.
def create_secret(service_client, arn, token):
# Source of truth is AWSCURRENT; we derive the new value from it.
current = service_client.get_secret_value(SecretId=arn, VersionStage="AWSCURRENT")
try:
service_client.get_secret_value(
SecretId=arn, VersionId=token, VersionStage="AWSPENDING"
)
# Pending already staged on a retry — do not regenerate.
return
except service_client.exceptions.ResourceNotFoundException:
pass
secret = json.loads(current["SecretString"])
secret["password"] = service_client.get_random_password(
PasswordLength=32, ExcludePunctuation=True
)["RandomPassword"]
service_client.put_secret_value(
SecretId=arn,
ClientRequestToken=token,
SecretString=json.dumps(secret),
VersionStages=["AWSPENDING"],
)
The finishSecret step is equally precise — it is a relabel, not a write. Find the version currently holding AWSCURRENT (returning early if it is already token, i.e. a retry), then move the label:
def finish_secret(service_client, arn, token, current_version):
if current_version == token:
return # Already promoted (retry).
service_client.update_secret_version_stage(
SecretId=arn,
VersionStage="AWSCURRENT",
MoveToVersionId=token,
RemoveFromVersionId=current_version,
)
For most database engines you do not write this yourself — AWS publishes managed rotation functions as serverless application templates. Write a custom Lambda only for a target AWS does not cover.
3. RDS rotation: single-user vs alternating-user
For RDS, RDS Proxy, DocumentDB, and Redshift, you choose a rotation strategy, and the choice has real operational consequences.
Single-user rotation changes the password on the same database user every cycle. Simple, but hazardous: between setSecret (which runs ALTER USER ... PASSWORD) and the moment your application picks up the new AWSCURRENT, any connection attempt with the old password fails. If your app caches the secret and only refreshes on auth failure, you get a brief window of errors every rotation. Acceptable for low-traffic or fault-tolerant apps; not for a hot path.
Alternating-user rotation (the “multi-user” strategy) is the production-grade choice. It clones the existing user into a second account and alternates between the two each cycle. Cycle N uses app_user; cycle N+1 rotates app_user_clone, swaps AWSCURRENT to it, and leaves the previous user fully valid. Because the newly promoted credential belongs to a user that already existed with a known-good password, there is no window where the current credential is invalid.
| Strategy | Users | Rotation-window failures | Setup requirement |
|---|---|---|---|
| Single-user | 1 | Possible (brief) | App must hold create/alter on itself |
| Alternating-user | 2 | None if app respects AWSCURRENT |
A superuser secret to clone the user |
Alternating-user requires a second secret — the master/superuser secret — referenced via masterarn, because cloning a user and granting its privileges needs elevated rights the application user does not have. The rotation Lambda authenticates with the master secret to provision the clone, then rotates the application secret.
# Enable alternating-user rotation on an RDS secret using the AWS-managed
# Postgres rotation function, every 30 days.
aws secretsmanager rotate-secret \
--secret-id prod/payments/db-app \
--rotation-lambda-arn arn:aws:lambda:us-east-1:111122223333:function:SecretsManagerRDSPostgreSQLRotationMultiUser \
--rotation-rules '{"ScheduleExpression": "rate(30 days)", "Duration": "2h"}'
The application’s secret must declare which strategy it uses. A multi-user Postgres secret looks like this:
{
"engine": "postgres",
"host": "payments.cluster-abc123.us-east-1.rds.amazonaws.com",
"username": "app_user",
"password": "current-password",
"dbname": "payments",
"port": 5432,
"masterarn": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/payments/db-master-XyZ"
}
Put RDS Proxy in front of this. The proxy keeps a warm connection pool and integrates natively with Secrets Manager, so when AWSCURRENT flips it picks up the new credential without each application instance re-reading the secret. Combined with alternating-user rotation, cutover becomes a non-event.
4. VPC, networking, and KMS permissions
This is where rotation breaks in practice, and the failure is always the same flavour: the Lambda times out with no useful error. The function has to reach two endpoints — the database and the Secrets Manager API — and both paths must be open.
If your database is in private subnets, the rotation Lambda must attach to the same VPC with security-group access to the DB port. But a VPC Lambda loses default internet egress and still needs to call Secrets Manager. Two options:
- A Secrets Manager interface VPC endpoint (
com.amazonaws.<region>.secretsmanager) in the Lambda’s subnets — the correct answer for private workloads. - A NAT gateway route — works, but pays egress and leaves the path public-ish.
resource "aws_vpc_endpoint" "secretsmanager" {
vpc_id = var.vpc_id
service_name = "com.amazonaws.${var.region}.secretsmanager"
vpc_endpoint_type = "Interface"
subnet_ids = var.private_subnet_ids
security_group_ids = [aws_security_group.sm_endpoint.id]
private_dns_enabled = true
}
# The rotation Lambda's SG must be allowed inbound on the DB port.
resource "aws_security_group_rule" "db_from_rotator" {
type = "ingress"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_group_id = aws_security_group.rds.id
source_security_group_id = aws_security_group.rotation_lambda.id
}
For KMS: if the secret is encrypted with a customer-managed key (CMK) rather than the AWS-managed aws/secretsmanager key — and at scale it should be, for cross-account and auditability reasons — the rotation Lambda’s execution role needs kms:Decrypt and kms:GenerateDataKey on that key, and the key policy must permit Secrets Manager. The role also needs the standard rotation permissions plus secretsmanager:GetRandomPassword:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"secretsmanager:DescribeSecret",
"secretsmanager:GetSecretValue",
"secretsmanager:PutSecretValue",
"secretsmanager:UpdateSecretVersionStage"
],
"Resource": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/*",
"Condition": {
"StringEquals": { "aws:ResourceTag/rotation": "managed" }
}
},
{
"Effect": "Allow",
"Action": "secretsmanager:GetRandomPassword",
"Resource": "*"
},
{
"Effect": "Allow",
"Action": ["kms:Decrypt", "kms:GenerateDataKey"],
"Resource": "arn:aws:kms:us-east-1:111122223333:key/<cmk-id>"
}
]
}
Secrets Manager invokes your function on your behalf, so the Lambda’s resource policy must grant lambda:InvokeFunction to secretsmanager.amazonaws.com. The AWS console wires this automatically; in IaC you add it explicitly:
resource "aws_lambda_permission" "allow_secretsmanager" {
statement_id = "AllowSecretsManagerInvoke"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.rotator.function_name
principal = "secretsmanager.amazonaws.com"
}
5. A custom rotator for non-RDS credentials
For third-party API keys, a Mongo Atlas user, or an internal service token, there is no managed function — you implement the four steps against the provider’s API. The shape is identical; only setSecret and testSecret change. The key discipline: many providers do not let you set a key value — they generate and return it. There, createSecret and setSecret collapse — you call the provider’s “create credential” API once, store the result as AWSPENDING, and make setSecret a near no-op. Clean up the old credential only in finishSecret, never before — deleting the old key before the new one is promoted is how you cause an outage.
def set_secret(service_client, arn, token):
pending = json.loads(
service_client.get_secret_value(
SecretId=arn, VersionId=token, VersionStage="AWSPENDING"
)["SecretString"]
)
# Idempotent: only create at the provider if this pending key is not live yet.
if not provider_key_exists(pending["api_key_id"]):
provider_create_key(pending["api_key_id"], pending["api_key_secret"])
def test_secret(service_client, arn, token):
pending = json.loads(
service_client.get_secret_value(
SecretId=arn, VersionId=token, VersionStage="AWSPENDING"
)["SecretString"]
)
resp = provider_authenticated_call(pending["api_key_secret"])
if resp.status_code != 200:
raise ValueError("Pending credential failed validation")
Deploy and schedule a custom rotator the same way, pointing --rotation-lambda-arn at your function. Validate one step at a time before enabling the schedule — RotateSecretCommand runs all four immediately, and you do not want to discover a broken finishSecret in production.
6. Cross-account sharing
A central account often owns secrets that spoke-account workloads consume — a shared database credential, a partner API key. Two things must both be true or it silently fails: a resource policy on the secret, and KMS permissions on the encrypting key.
The resource policy grants the consumer account access to the secret:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::444455556666:root" },
"Action": ["secretsmanager:GetSecretValue", "secretsmanager:DescribeSecret"],
"Resource": "*"
}
]
}
The trap: you cannot use the AWS-managed aws/secretsmanager key for cross-account access — its key policy is not editable to permit another account. Encrypt cross-account secrets with a customer-managed CMK and grant the consumer account kms:Decrypt in the key policy:
{
"Sid": "AllowConsumerDecrypt",
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::444455556666:root" },
"Action": "kms:Decrypt",
"Resource": "*",
"Condition": {
"StringEquals": {
"kms:ViaService": "secretsmanager.us-east-1.amazonaws.com"
}
}
}
On the consumer side, the IAM principal still needs its own identity-based policy allowing secretsmanager:GetSecretValue and kms:Decrypt against those ARNs — resource policy and identity policy must both allow, since each account authorizes independently. Rotation keeps running in the owning account; consumers always resolve AWSCURRENT and so follow rotation automatically.
7. Application integration without rotation-window failures
The cache makes or breaks the consumer experience. Calling GetSecretValue on every request is slow and hits API throttling. The right pattern is the AWS caching client (Java, Python, Go, .NET, or the Lambda extension), which caches AWSCURRENT in memory and refreshes on an interval.
from aws_secretsmanager_caching import SecretCache, SecretCacheConfig
import boto3
client = boto3.client("secretsmanager")
cache = SecretCache(config=SecretCacheConfig(secret_refresh_interval=3600), client=client)
def get_connection():
secret = json.loads(cache.get_secret_string("prod/payments/db-app"))
return connect(secret) # On auth failure, force-refresh and retry once.
Two rules keep you safe across rotations:
- Refresh on auth failure, not only on a timer. If a DB connection is rejected, invalidate the cache entry, re-read
AWSCURRENT, and retry once. With alternating-user rotation this almost never fires, but it is your safety net for single-user. - Never pin a
VersionId. ReadAWSCURRENT(the default). Pinning a version means you never follow rotation — the cardinal sin.
The Lambda extension is cleanest for serverless consumers: it runs a local HTTP cache with built-in TTL, so your function makes a localhost call instead of hitting the Secrets Manager API.
8. Monitoring rotation health and staleness
Rotation that fails silently is worse than no rotation, because you believe you are covered. Wire three signals:
RotationFailed— Secrets Manager emits this event on failure; route it through EventBridge to SNS.RotationSucceeded— track it to detect absence. A secret that has not rotated on schedule is the real risk, visible only by alarming on staleness.- Last-rotated age — a scheduled check comparing
LastRotatedDateagainst the schedule, alerting on overdue secrets.
{
"source": ["aws.secretsmanager"],
"detail-type": ["AWS Service Event via CloudTrail"],
"detail": {
"eventName": ["RotationFailed"]
}
}
For staleness, AWS Config is the durable approach. Its managed rules secretsmanager-rotation-enabled-check and secretsmanager-secret-periodic-rotation flag secrets without rotation enabled or not rotated within a maximum age — far better than a custom script you will forget to maintain.
Enterprise scenario
A payments platform team ran 40+ microservices against an Aurora PostgreSQL cluster, all sharing one application credential rotated single-user every 7 days. Every rotation produced a 20-to-90 second spike of 500s as services hit auth failures and slowly refreshed their caches. The on-call playbook literally said “ignore the Thursday 02:00 error spike” — which is exactly how a real incident later got ignored for eleven minutes.
The constraints were hard: they could not coordinate a synchronized restart of 40 services, and the payments hot path’s error budget was being burned by the rotation spike alone. The fix had two moves. First, they switched the secret to alternating-user rotation with a master secret, so the promoted credential always belonged to a user that already existed with a valid password — eliminating the invalid-credential window. Second, they put RDS Proxy in front of the cluster so cutover happened once, at the proxy, instead of independently in 40 connection pools.
aws rds create-db-proxy \
--db-proxy-name payments-proxy \
--engine-family POSTGRESQL \
--auth '[{"AuthScheme":"SECRETS","SecretArn":"arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/payments/db-app-XyZ","IAMAuth":"DISABLED"}]' \
--role-arn arn:aws:iam::111122223333:role/payments-proxy-role \
--vpc-subnet-ids subnet-aaa subnet-bbb
After the change, rotation-window errors went to zero across a quarter of weekly rotations, and the “ignore the Thursday spike” line was deleted from the runbook — the real win, because a runbook that trains operators to ignore alerts is a latent incident.
Verify
Confirm rotation actually works before trusting it:
# Trigger an immediate rotation cycle (runs all four steps now).
aws secretsmanager rotate-secret --secret-id prod/payments/db-app
# Inspect version-to-stage mapping; AWSCURRENT should move to a new version,
# and AWSPREVIOUS should appear after a successful cycle.
aws secretsmanager describe-secret --secret-id prod/payments/db-app \
--query '{Rotated:LastRotatedDate, Schedule:RotationRules, Versions:VersionIdsToStages}'
# Resolve the live credential and confirm it authenticates.
aws secretsmanager get-secret-value --secret-id prod/payments/db-app \
--query SecretString --output text | jq -r '.username'
Check that LastRotatedDate advanced, that exactly one version holds AWSCURRENT, and that no AWSPENDING lingers (a stuck AWSPENDING means a step failed mid-cycle). Then pull the rotation Lambda’s CloudWatch logs and confirm all four steps logged success.