AWS Lesson 60 of 123

Secrets Manager Rotation at Scale: Custom Rotation Lambdas, RDS Credentials, and Cross-Account Sharing

Storing a secret in AWS Secrets Manager is the easy part. The thing that actually reduces blast radius is rotation — and rotation is where teams quietly fail, because a botched cycle either locks an application out of its own database or, worse, succeeds silently while applications keep using a credential that no longer works. The mechanics are precise: one staging-label state machine you must respect, a versioning model you cannot shortcut, and networking/KMS permissions that are unforgiving when wrong. Get one of those three wrong and the failure mode is the same flavour every time — a Lambda that times out with no useful error, a secret stuck in AWSPENDING, and a 02:00 pager.

This guide walks the four-step rotation model end to end, contrasts single-user and alternating-user RDS strategies, shows how to write a custom rotator for non-RDS credentials, and covers the cross-account sharing platform teams always end up needing. It is built as a reference you keep open mid-incident: read the prose once, then keep the tables — the step contract, the IAM grants, the VPC/KMS requirements, the failure playbook — open at 02:00. Every operation gets both the aws CLI and the IaC (Terraform) form, because the console wires permissions you will forget to declare in code.

By the end you will stop guessing. When rotation fails you will know whether it is a non-idempotent step, a missing Secrets Manager VPC endpoint, a master secret that cannot clone the user, an aws/secretsmanager-encrypted secret that can never go cross-account, or a consumer whose identity policy was never granted even though the resource policy was. Knowing which in ninety seconds is what separates a five-minute incident from a two-hour one.

What problem this solves

A long-lived database password is a standing liability. It leaks through a log line, a .env committed by accident, a laptop image, an offboarded contractor who still remembers it. The only durable mitigation is to make the credential short-lived — rotate it on a schedule so that any copy an attacker holds expires on its own. Secrets Manager automates that, but the automation has sharp edges, and the edges are exactly where production breaks.

What breaks without getting this right: an engineer enables single-user rotation on a hot-path credential and every cycle produces a spike of authentication errors as forty services slowly notice the password changed; a rotation Lambda is dropped into the database VPC but never given a Secrets Manager endpoint, so it times out reaching the API and the secret sticks half-rotated in AWSPENDING; a central platform team shares a secret cross-account using the AWS-managed key and the consumer account gets AccessDenied on kms:Decrypt forever, because that key’s policy cannot be edited; a finishSecret step is non-idempotent, Secrets Manager retries it, and the staging labels desync. Each of these is perfectly diagnosable and each costs a team an afternoon the first time.

Who hits this: anyone running RDS/Aurora, DocumentDB, Redshift, RDS Proxy, or any credentialed third-party service at scale. It bites hardest on multi-account organisations (cross-account KMS is non-obvious), VPC-isolated databases (the networking is the silent killer), high-traffic hot paths (single-user rotation windows show up as real error budget), and custom integrations where AWS publishes no managed rotation function and you must write all four steps yourself.

To frame the whole field before the deep dive, here is every failure class this article covers, the question it forces, and the one place to look first:

Failure class What you observe First question to ask First place to look Most common single cause
Rotation Lambda timeout RotationFailed, stuck AWSPENDING Can the Lambda reach BOTH the DB and the SM API? CloudWatch Logs for the rotator No Secrets Manager VPC endpoint in private subnets
Rotation-window auth errors 5xx spike every cycle Single-user or alternating-user? VersionIdsToStages + app logs Single-user on a hot path
setSecret permission denied Logs stop at step 2 Can the function clone / ALTER USER? Rotator logs, DB grants Missing masterarn (alternating-user)
Cross-account AccessDenied Consumer can’t GetSecretValue Resource policy AND KMS AND identity policy all allow? simulate-principal-policy aws/secretsmanager key (uneditable) cross-account
Stale / never-rotated secret Believed covered, never rotated Is rotation even enabled and succeeding? AWS Config rules, LastRotatedDate Rotation disabled or silently failing
App uses dead credential Auth fails after a clean rotation Is the app pinned to a VersionId? App config / cache code Pinned version instead of AWSCURRENT

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the Secrets Manager basics — that a secret is a named, versioned, KMS-encrypted blob you read with GetSecretValue, billed per secret per month plus per 10,000 API calls. You should be comfortable with IAM identity-based vs resource-based policies, KMS key policies, and the difference between an AWS-managed key (aws/secretsmanager) and a customer-managed key (CMK). Basic VPC literacy (private subnets, security groups, interface endpoints) and the ability to run aws CLI v2 and read JSON output are assumed.

This sits at the intersection of the Security and Databases tracks. It builds directly on the AWS Secrets Manager & Parameter Store Deep Dive (storage, naming, versioning fundamentals) and the AWS KMS Encryption Deep Dive (key policies, grants, envelope encryption — the cross-account half of this article is a KMS problem). On the database side it pairs with the AWS RDS & Aurora Deep Dive and especially RDS Proxy: Connection Pooling, Failover, IAM Auth, because RDS Proxy is what turns rotation cutover into a non-event. The cross-account sharing pattern leans on IAM Cross-Account Roles, External ID, Confused Deputy and Multi-Region KMS Keys & Envelope Encryption. The rotation engine itself is a Lambda running inside the database VPC.

A quick map of who owns what during a rotation incident, so you call the right person fast:

Layer What lives here Who usually owns it Failure classes it can cause
Secret + staging labels The versioned value, AWSCURRENT/AWSPENDING App / platform Stuck pending, desynced labels
Rotation Lambda The four-step state machine App / platform Non-idempotent step, timeout
VPC / SG / endpoints Network path to DB and SM API Network team Lambda timeout (no route)
IAM role + resource policy Who may rotate / invoke Security / platform Permission denied mid-cycle
KMS key + key policy Encryption + cross-account decrypt Security / KMS admin Cross-account AccessDenied
RDS / Aurora The backing user(s), master secret DBA / platform ALTER USER fails, no clone
Consumer app Cache, refresh, version pinning Consuming team Uses dead credential after rotation

Core concepts

Five mental models make every later diagnosis obvious.

Rotation is a relabel, never an in-place edit. A secret holds one or more versions, each carrying one or more staging labels. Rotation creates a new version, parks it under AWSPENDING, proves it works, then atomically moves AWSCURRENT onto it. Applications reading AWSCURRENT (the default) never see the pending value until cutover completes — that is the entire reason rotation can be zero-downtime. If you ever find yourself “updating” a secret value during rotation, you are off the rails.

Secrets Manager drives your Lambda four times. It invokes your function once per step (createSecret, setSecret, testSecret, finishSecret), passing the step name and the new version’s ID (ClientRequestToken). Your function is a dispatcher on that field. Secrets Manager retries on failure, so every step must be idempotent — running twice must be safe.

RDS rotation has two strategies, and the choice is operational, not cosmetic. Single-user changes the password on the same user every cycle and has a brief window where the old credential is invalid. Alternating-user keeps two users and swaps between them, so the promoted credential always belonged to a user that already existed with a known-good password — no invalid window. The alternating strategy needs a second, elevated master secret to clone the user.

The rotation Lambda must reach two endpoints, and a VPC Lambda loses easy egress. It must reach the database (security-group access on the DB port) and the Secrets Manager API. A Lambda attached to a VPC has no default internet egress, so it needs either a Secrets Manager interface VPC endpoint or a NAT path to call the API. Forgetting this is the single most common rotation failure, and it presents as a bare timeout.

Cross-account sharing requires three independent allows. The owning account’s secret resource policy must grant the consumer; the KMS key must permit the consumer to Decrypt (and it must be a CMK — the managed key cannot be shared); and the consumer’s own identity-based policy must allow GetSecretValue and kms:Decrypt. All three must be true or it silently fails, because each account authorises independently.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters to rotation
Version An immutable copy of the secret value The secret Rotation writes a new one each cycle
Staging label A movable pointer to one version On a version AWSCURRENT/AWSPENDING drive cutover
AWSCURRENT The version apps get by default One version What consumers should always read
AWSPENDING The in-flight version being tested One version (transient) Lingering = a step failed mid-cycle
AWSPREVIOUS The version that was current One version One-generation rollback
Rotation Lambda The four-step state machine Lambda (often in VPC) The engine; must be idempotent
Step Which of the four phases is running Invocation event Your dispatcher switches on it
masterarn Pointer to the elevated master secret Secret JSON Required for alternating-user
Single-user Rotate the same DB user Rotation strategy Brief invalid-credential window
Alternating-user Swap between two DB users Rotation strategy No invalid window (production-grade)
SM VPC endpoint Interface endpoint to the SM API Private subnets Lets a VPC Lambda call the API
Resource policy Who else may read this secret On the secret Cross-account / cross-principal grant
CMK Customer-managed KMS key KMS The only key shareable cross-account

1. The versioning and staging-label model

Everything in rotation hinges on staging labels. A secret holds one or more versions, each carrying labels. Three are reserved and load-bearing:

A single version label points to one version at a time. Rotation is fundamentally the act of creating a new version labelled AWSPENDING, proving it works, then atomically moving AWSCURRENT onto it. Applications that always read AWSCURRENT (the default) never see the pending value until cutover completes. That is the whole reason rotation can be zero-downtime: the new credential is fully provisioned and tested before anything points production at it.

Mental model: rotation never edits a secret in place. It writes a new version, parks it under AWSPENDING, and only relabels AWSCURRENT at the very end. If you ever find yourself “updating” a secret value during rotation, you are off the rails.

The three reserved labels, what moves them, and what each means for callers:

Staging label Set by Points to Exists between rotations? What a caller gets
AWSCURRENT finishSecret (relabel) The live, in-use version Yes (always exactly one) Default value from GetSecretValue
AWSPENDING createSecret (PutSecretValue) The new version under test No (transient) Only if explicitly requested by stage
AWSPREVIOUS Auto, when AWSCURRENT moves The prior current version Yes after first rotation One generation back, for rollback
Custom label You, via UpdateSecretVersionStage Any version you choose Yes if you set it Whatever you pin it to (avoid for apps)

The version-stage invariants you must never violate — break one and rotation desyncs:

Invariant Why it holds What violating it causes
Exactly one version holds AWSCURRENT Callers need one unambiguous live value Apps read an indeterminate credential
AWSPENDING is removed once promoted It only labels the in-flight version Lingering pending blocks the next cycle
A version is immutable once written Versions are content-addressed by token “Editing” forks state from what was applied
AWSPREVIOUS is one generation only Rollback target, not history Treating it as an archive loses older values
Labels move atomically Cutover must be all-or-nothing Partial moves expose a half-rotated secret

Inspect the live label-to-version mapping at any time — this is your first diagnostic command in any rotation incident:

aws secretsmanager describe-secret --secret-id prod/payments/db-app \
  --query 'VersionIdsToStages'
# Healthy: one version → ["AWSCURRENT"], one → ["AWSPREVIOUS"], NO "AWSPENDING".
# A lingering "AWSPENDING" means a step failed mid-cycle.

2. The four-step rotation Lambda

Secrets Manager drives rotation by invoking your Lambda four times, passing a Step field in the event each time. Your function is a dispatcher on that field. The contract is fixed:

{
  "SecretId": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/app/db-AbCdEf",
  "ClientRequestToken": "uuid-of-the-new-version",
  "Step": "createSecret"
}

The ClientRequestToken is the version ID of the new (AWSPENDING) version. What each step must do, and the idempotency rule that keeps retries safe:

Step Responsibility Idempotency rule Failure if you skip it
createSecret Generate the new credential and store it as AWSPENDING. If AWSPENDING already exists, do nothing. Each retry regenerates → desync vs setSecret
setSecret Apply the pending credential to the backing service (e.g. ALTER USER). Must tolerate being run twice. Double-apply or partial apply on retry
testSecret Connect/authenticate using the pending credential. Read-only validation. A broken credential gets promoted
finishSecret Move AWSCURRENT to the pending version. Return early if already promoted. Cutover repeats or strands the label

The four steps as a state machine — what is true before and after each, and who acts:

Phase Before Action After Actor
Start One version: AWSCURRENT SM allocates a new version ID Token reserved for AWSPENDING Secrets Manager
createSecret No AWSPENDING Generate value, PutSecretValue New version labelled AWSPENDING Your Lambda
setSecret Pending exists, not yet live in DB Apply to backing service DB accepts the pending credential Your Lambda
testSecret Pending applied Authenticate with pending Pending proven good Your Lambda
finishSecret Pending good Move AWSCURRENT → pending Pending becomes current; old → previous Your Lambda

The most common bug is non-idempotent steps. Secrets Manager retries; if createSecret blindly generates a new password every invocation, you desync the stored secret from what setSecret applied. Always check for an existing AWSPENDING first.

def create_secret(service_client, arn, token):
    # Source of truth is AWSCURRENT; we derive the new value from it.
    current = service_client.get_secret_value(SecretId=arn, VersionStage="AWSCURRENT")

    try:
        service_client.get_secret_value(
            SecretId=arn, VersionId=token, VersionStage="AWSPENDING"
        )
        # Pending already staged on a retry — do not regenerate.
        return
    except service_client.exceptions.ResourceNotFoundException:
        pass

    secret = json.loads(current["SecretString"])
    secret["password"] = service_client.get_random_password(
        PasswordLength=32, ExcludePunctuation=True
    )["RandomPassword"]

    service_client.put_secret_value(
        SecretId=arn,
        ClientRequestToken=token,
        SecretString=json.dumps(secret),
        VersionStages=["AWSPENDING"],
    )

The finishSecret step is equally precise — it is a relabel, not a write. Find the version currently holding AWSCURRENT (returning early if it is already token, i.e. a retry), then move the label:

def finish_secret(service_client, arn, token, current_version):
    if current_version == token:
        return  # Already promoted (retry).
    service_client.update_secret_version_stage(
        SecretId=arn,
        VersionStage="AWSCURRENT",
        MoveToVersionId=token,
        RemoveFromVersionId=current_version,
    )

For most database engines you do not write this yourself — AWS publishes managed rotation functions as serverless application templates. The managed functions you should reach for before writing custom code:

Backing service Managed function family Strategies offered Write custom instead?
RDS/Aurora PostgreSQL SecretsManagerRDSPostgreSQLRotation… Single-user, MultiUser No — use managed
RDS/Aurora MySQL/MariaDB SecretsManagerRDSMySQLRotation… Single-user, MultiUser No — use managed
RDS SQL Server SecretsManagerRDSSQLServerRotation… Single-user, MultiUser No — use managed
RDS Oracle SecretsManagerRDSOracleRotation… Single-user, MultiUser No — use managed
Amazon Redshift SecretsManagerRedshiftRotation… Single-user, MultiUser No — use managed
DocumentDB SecretsManagerMongoDBRotation… Single-user, MultiUser No — use managed
Generic / third-party API SecretsManagerRotationTemplate You implement Yes — start from the template
OS / SSH / non-API target (none) Yes — full custom

The dispatcher skeleton that ties the four steps together — note the explicit guard that rotation is even enabled and that the version is pending:

def lambda_handler(event, context):
    arn, token, step = event["SecretId"], event["ClientRequestToken"], event["Step"]
    client = boto3.client("secretsmanager")

    md = client.describe_secret(SecretId=arn)
    if not md.get("RotationEnabled"):
        raise ValueError(f"Rotation not enabled for {arn}")
    versions = md["VersionIdsToStages"]
    if token not in versions:
        raise ValueError(f"Version {token} has no stage for {arn}")
    if "AWSCURRENT" in versions[token]:
        return  # Already current — nothing to do.

    current_version = next(v for v, s in versions.items() if "AWSCURRENT" in s)
    {
        "createSecret": lambda: create_secret(client, arn, token),
        "setSecret":    lambda: set_secret(client, arn, token),
        "testSecret":   lambda: test_secret(client, arn, token),
        "finishSecret": lambda: finish_secret(client, arn, token, current_version),
    }[step]()

3. RDS rotation: single-user vs alternating-user

For RDS, RDS Proxy, DocumentDB, and Redshift, you choose a rotation strategy, and the choice has real operational consequences.

Single-user rotation changes the password on the same database user every cycle. Simple, but hazardous: between setSecret (which runs ALTER USER ... PASSWORD) and the moment your application picks up the new AWSCURRENT, any connection attempt with the old password fails. If your app caches the secret and only refreshes on auth failure, you get a brief window of errors every rotation. Acceptable for low-traffic or fault-tolerant apps; not for a hot path.

Alternating-user rotation (the “multi-user” strategy) is the production-grade choice. It clones the existing user into a second account and alternates between the two each cycle. Cycle N uses app_user; cycle N+1 rotates app_user_clone, swaps AWSCURRENT to it, and leaves the previous user fully valid. Because the newly promoted credential belongs to a user that already existed with a known-good password, there is no window where the current credential is invalid.

Strategy Users Rotation-window failures Setup requirement Best for
Single-user 1 Possible (brief) App user can ALTER itself Low-traffic, fault-tolerant apps
Alternating-user 2 None if app respects AWSCURRENT A superuser secret to clone the user Hot paths, fleets, strict SLAs

A fuller side-by-side, because the trade-offs go beyond the failure window:

Dimension Single-user Alternating-user
DB users required 1 2 (original + clone)
Master/superuser secret Not required Required (masterarn)
Invalid-credential window Yes (until cache refresh) None
Privileges the rotator needs ALTER USER on self CREATE/ALTER USER, clone grants
Behaviour on app cache lag Auth errors until refresh Old user still valid → no errors
Rollback safety One generation One generation + prior user live
Operational complexity Lower Higher (two users to reason about)
Recommended for hot paths No Yes

Alternating-user requires a second secret — the master/superuser secret — referenced via masterarn, because cloning a user and granting its privileges needs elevated rights the application user does not have. The rotation Lambda authenticates with the master secret to provision the clone, then rotates the application secret.

# Enable alternating-user rotation on an RDS secret using the AWS-managed
# Postgres rotation function, every 30 days.
aws secretsmanager rotate-secret \
  --secret-id prod/payments/db-app \
  --rotation-lambda-arn arn:aws:lambda:us-east-1:111122223333:function:SecretsManagerRDSPostgreSQLRotationMultiUser \
  --rotation-rules '{"ScheduleExpression": "rate(30 days)", "Duration": "2h"}'

The application’s secret must declare which strategy it uses. A multi-user Postgres secret looks like this — every field below is load-bearing for the managed function:

{
  "engine": "postgres",
  "host": "payments.cluster-abc123.us-east-1.rds.amazonaws.com",
  "username": "app_user",
  "password": "current-password",
  "dbname": "payments",
  "port": 5432,
  "masterarn": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/payments/db-master-XyZ"
}

The required and optional keys in an RDS secret JSON, and what the managed function does with each:

Key Required? Used by Notes / gotcha
engine Yes Function (driver select) postgres, mysql, oracle, sqlserver, etc.
host Yes Connect Cluster endpoint; use the writer for DDL
port Yes Connect 5432 PG, 3306 MySQL, 1521 Oracle, 1433 MSSQL
username Yes ALTER/auth The rotating app user
password Yes Auth The function overwrites this each cycle
dbname Often Connect Some engines require it to authenticate
masterarn Alternating only Clone the user Omit → single-user; present → multi-user
dbInstanceIdentifier Optional Resolve host Alternative to host for some templates

The schedule controls in --rotation-rules, and how each behaves:

Field Meaning Example Notes
ScheduleExpression When rotation fires rate(30 days) / cron(0 3 1 * ? *) rate minimum is effectively daily
Duration Rotation window length "2h" SM starts within this window, not exactly on the dot
AutomaticallyAfterDays Legacy interval (days) 30 Superseded by ScheduleExpression
(manual trigger) rotate-secret with no rules Runs all four steps immediately, once

Put RDS Proxy in front of this. The proxy keeps a warm connection pool and integrates natively with Secrets Manager, so when AWSCURRENT flips it picks up the new credential without each application instance re-reading the secret. Combined with alternating-user rotation, cutover becomes a non-event — covered in depth in RDS Proxy: Connection Pooling, Failover, IAM Auth.

4. VPC, networking, and KMS permissions

This is where rotation breaks in practice, and the failure is always the same flavour: the Lambda times out with no useful error. The function has to reach two endpoints — the database and the Secrets Manager API — and both paths must be open.

If your database is in private subnets, the rotation Lambda must attach to the same VPC with security-group access to the DB port. But a VPC Lambda loses default internet egress and still needs to call Secrets Manager. The two ways to give it API access, side by side:

Egress option How it works Cost When to choose Gotcha
Secrets Manager interface endpoint com.amazonaws.<region>.secretsmanager ENI in the Lambda subnets ~hourly/ENI + per-GB Private workloads (the right answer) Needs private_dns_enabled and an SG allowing 443
NAT gateway Route 0.0.0.0/0 → NAT → IGW Hourly + per-GB egress You already run NAT for other reasons Pays egress; path is public-ish
(none — Lambda not in VPC) Public Lambda reaches SM directly None extra DB is publicly reachable (rare/bad) Can’t reach a private DB at all

The two endpoints the rotation Lambda must reach, and what blocks each:

Target Reached over Requires If blocked you see
RDS/Aurora (DB port) VPC, in-subnet SG ingress on 5432/3306 from rotator SG Timeout at setSecret/testSecret
Secrets Manager API Interface endpoint or NAT Endpoint + SG 443, or NAT route Timeout before any step logs
KMS (decrypt the secret) Endpoint or NAT kms:Decrypt + reachable KMS AccessDenied or KMS timeout
(alternating) Master secret read Secrets Manager API Same as SM API + read on master setSecret cannot clone the user
resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id              = var.vpc_id
  service_name        = "com.amazonaws.${var.region}.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.sm_endpoint.id]
  private_dns_enabled = true
}

# The rotation Lambda's SG must be allowed inbound on the DB port.
resource "aws_security_group_rule" "db_from_rotator" {
  type                     = "ingress"
  from_port                = 5432
  to_port                  = 5432
  protocol                 = "tcp"
  security_group_id        = aws_security_group.rds.id
  source_security_group_id = aws_security_group.rotation_lambda.id
}

For KMS: if the secret is encrypted with a customer-managed key (CMK) rather than the AWS-managed aws/secretsmanager key — and at scale it should be, for cross-account and auditability reasons — the rotation Lambda’s execution role needs kms:Decrypt and kms:GenerateDataKey on that key, and the key policy must permit Secrets Manager. The role also needs the standard rotation permissions plus secretsmanager:GetRandomPassword.

The exact IAM actions the rotation execution role needs, and why each is there:

Action On resource Why the rotator needs it
secretsmanager:DescribeSecret The secret Read rotation state and version stages
secretsmanager:GetSecretValue The secret (+ master) Read AWSCURRENT/AWSPENDING, master to clone
secretsmanager:PutSecretValue The secret Write the new AWSPENDING version
secretsmanager:UpdateSecretVersionStage The secret Move AWSCURRENT at finishSecret
secretsmanager:GetRandomPassword * Generate the new password
kms:Decrypt The CMK Decrypt the stored secret
kms:GenerateDataKey The CMK Encrypt the new version
ec2:CreateNetworkInterface etc. * VPC Lambda ENI management (managed policy)
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:DescribeSecret",
        "secretsmanager:GetSecretValue",
        "secretsmanager:PutSecretValue",
        "secretsmanager:UpdateSecretVersionStage"
      ],
      "Resource": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/*",
      "Condition": {
        "StringEquals": { "aws:ResourceTag/rotation": "managed" }
      }
    },
    {
      "Effect": "Allow",
      "Action": "secretsmanager:GetRandomPassword",
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["kms:Decrypt", "kms:GenerateDataKey"],
      "Resource": "arn:aws:kms:us-east-1:111122223333:key/<cmk-id>"
    }
  ]
}

Secrets Manager invokes your function on your behalf, so the Lambda’s resource policy must grant lambda:InvokeFunction to secretsmanager.amazonaws.com. The AWS console wires this automatically; in IaC you add it explicitly:

resource "aws_lambda_permission" "allow_secretsmanager" {
  statement_id  = "AllowSecretsManagerInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.rotator.function_name
  principal     = "secretsmanager.amazonaws.com"
}

The three policies rotation touches — confusing them is a common time-sink, so keep them straight:

Policy Attached to Direction Grants
Execution role (identity) The Lambda What the Lambda may do SM read/write, KMS, ENI
Lambda resource policy The Lambda Who may invoke it secretsmanager.amazonaws.com invoke
KMS key policy The CMK Who may use the key SM service + rotator + (cross-acct) consumer
Secret resource policy The secret Who else may read it Consumer account (cross-account)

5. A custom rotator for non-RDS credentials

For third-party API keys, a MongoDB Atlas user, or an internal service token, there is no managed function — you implement the four steps against the provider’s API. The shape is identical; only setSecret and testSecret change. The key discipline: many providers do not let you set a key value — they generate and return it. There, createSecret and setSecret collapse — you call the provider’s “create credential” API once, store the result as AWSPENDING, and make setSecret a near no-op. Clean up the old credential only in finishSecret, never before — deleting the old key before the new one is promoted is how you cause an outage.

The two provider archetypes and how the four steps adapt to each:

Step Provider accepts a value (DB-like) Provider issues the value (API-key-like)
createSecret Generate value, store AWSPENDING Call provider “create key”, store returned value as AWSPENDING
setSecret Apply value to the service (ALTER USER) Near no-op (key already live at provider)
testSecret Authenticate with pending value Make an authenticated call with pending key
finishSecret Move AWSCURRENT Move AWSCURRENT, then delete the old provider key
def set_secret(service_client, arn, token):
    pending = json.loads(
        service_client.get_secret_value(
            SecretId=arn, VersionId=token, VersionStage="AWSPENDING"
        )["SecretString"]
    )
    # Idempotent: only create at the provider if this pending key is not live yet.
    if not provider_key_exists(pending["api_key_id"]):
        provider_create_key(pending["api_key_id"], pending["api_key_secret"])

def test_secret(service_client, arn, token):
    pending = json.loads(
        service_client.get_secret_value(
            SecretId=arn, VersionId=token, VersionStage="AWSPENDING"
        )["SecretString"]
    )
    resp = provider_authenticated_call(pending["api_key_secret"])
    if resp.status_code != 200:
        raise ValueError("Pending credential failed validation")

Custom-rotator pitfalls that cause real outages, and the discipline that avoids each:

Pitfall What happens Discipline
Delete old key in setSecret Old key dies before promotion → outage Delete only in finishSecret, after cutover
Non-idempotent provider create Retry creates duplicate keys Check provider_key_exists before create
No testSecret call Broken key gets promoted Always make a real authenticated call
Assuming you can set the key value Provider rejects/ignores it Store what the provider returns
Hard provider rate limits Create/delete throttled mid-cycle Back off; keep the window (Duration) generous
Leaking the old key forever Credential sprawl finishSecret revokes the prior credential

Deploy and schedule a custom rotator the same way, pointing --rotation-lambda-arn at your function. Validate one step at a time before enabling the schedule — rotate-secret runs all four immediately, and you do not want to discover a broken finishSecret in production.

6. Cross-account sharing

A central account often owns secrets that spoke-account workloads consume — a shared database credential, a partner API key. Two things must both be true or it silently fails: a resource policy on the secret, and KMS permissions on the encrypting key. And on the consumer side, the principal needs its own identity policy. Three independent allows, evaluated in three different places.

The three grants cross-account sharing requires, where each lives, and the symptom when it is the one missing:

Grant Lives in Account Symptom if missing
Secret resource policy On the secret Owner AccessDenied on GetSecretValue
KMS key policy Decrypt On the CMK Owner AccessDenied on kms:Decrypt (can’t decrypt)
Identity policy On the principal Consumer AccessDenied even though resource policy allows
CMK (not aws/sm) KMS choice Owner Managed key can’t be shared at all

The resource policy grants the consumer account access to the secret:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::444455556666:root" },
      "Action": ["secretsmanager:GetSecretValue", "secretsmanager:DescribeSecret"],
      "Resource": "*"
    }
  ]
}

The trap: you cannot use the AWS-managed aws/secretsmanager key for cross-account access — its key policy is not editable to permit another account. Encrypt cross-account secrets with a customer-managed CMK and grant the consumer account kms:Decrypt in the key policy:

{
  "Sid": "AllowConsumerDecrypt",
  "Effect": "Allow",
  "Principal": { "AWS": "arn:aws:iam::444455556666:root" },
  "Action": "kms:Decrypt",
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "kms:ViaService": "secretsmanager.us-east-1.amazonaws.com"
    }
  }
}

Why aws/secretsmanager cannot go cross-account, contrasted with a CMK — this single distinction is the most common cross-account failure:

Property aws/secretsmanager (managed) Customer-managed CMK
Key policy editable No Yes
Cross-account grant possible No Yes
Cost Free ~$1/key/month + API
Rotation of the key itself AWS-managed You choose (annual+)
Auditability granularity Coarse Per-key CloudTrail
Use for shared secrets Never Always

On the consumer side, the IAM principal still needs its own identity-based policy allowing secretsmanager:GetSecretValue and kms:Decrypt against those ARNs — resource policy and identity policy must both allow, since each account authorizes independently. Confirm the full chain with the policy simulator before you trust it:

# From the CONSUMER account: does this role actually get the secret?
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::444455556666:role/app-role \
  --action-names secretsmanager:GetSecretValue kms:Decrypt \
  --resource-arns arn:aws:secretsmanager:us-east-1:111122223333:secret:shared/partner-api-XyZ \
  --query 'EvaluationResults[].{action:EvalActionName, decision:EvalDecision}'

Rotation keeps running in the owning account; consumers always resolve AWSCURRENT and so follow rotation automatically. The deeper cross-account mechanics (external ID, confused-deputy, session policies) are in IAM Cross-Account Roles, External ID, Confused Deputy; the KMS half is in AWS KMS Encryption Deep Dive.

7. Application integration without rotation-window failures

The cache makes or breaks the consumer experience. Calling GetSecretValue on every request is slow and hits API throttling. The right pattern is the AWS caching client (Java, Python, Go, .NET, or the Lambda extension), which caches AWSCURRENT in memory and refreshes on an interval.

from aws_secretsmanager_caching import SecretCache, SecretCacheConfig
import boto3

client = boto3.client("secretsmanager")
cache = SecretCache(config=SecretCacheConfig(secret_refresh_interval=3600), client=client)

def get_connection():
    secret = json.loads(cache.get_secret_string("prod/payments/db-app"))
    return connect(secret)  # On auth failure, force-refresh and retry once.

The integration approaches ranked from worst to best, with the trade-off each makes:

Pattern Latency API cost Rotation-safe? Verdict
GetSecretValue per request High High (throttles) Yes (always fresh) Never do this
Read once at boot, never refresh Low Minimal No — misses rotation Breaks on every cycle
Caching client (TTL) Low Low Yes (refreshes) Recommended for services
Caching client + refresh-on-auth-fail Low Low Yes + self-heals Best for single-user
Lambda extension (localhost cache) Lowest Low Yes Best for serverless
RDS Proxy (proxy holds the secret) Lowest None (app) Yes — cutover at proxy Best for shared DB creds

Two rules keep you safe across rotations:

  1. Refresh on auth failure, not only on a timer. If a DB connection is rejected, invalidate the cache entry, re-read AWSCURRENT, and retry once. With alternating-user rotation this almost never fires, but it is your safety net for single-user.
  2. Never pin a VersionId. Read AWSCURRENT (the default). Pinning a version means you never follow rotation — the cardinal sin.

The caching-client knobs and how to reason about each:

Setting What it does Default When to change
secret_refresh_interval TTL before re-reading AWSCURRENT 3600 s Lower for faster pickup; higher to cut API calls
max_cache_size How many secrets cached 1024 Raise if a process reads many secrets
exception_retry_delay_base Backoff on API errors seconds Tune under throttling
Force-refresh on auth fail Re-read immediately on rejection your code Always implement for single-user
Version stage requested Which label to read AWSCURRENT Leave default — never pin a version

The Lambda extension is cleanest for serverless consumers: it runs a local HTTP cache with built-in TTL, so your function makes a localhost call instead of hitting the Secrets Manager API.

8. Monitoring rotation health and staleness

Rotation that fails silently is worse than no rotation, because you believe you are covered. Wire three signals:

The signals to wire, what each tells you, and how to surface it:

Signal Source Tells you Surface via
RotationFailed SM event (CloudTrail) A cycle broke right now EventBridge → SNS/PagerDuty
RotationSucceeded SM event (CloudTrail) A cycle completed Metric; alarm on absence
LastRotatedDate age describe-secret Secret is overdue Scheduled Lambda / Config
Lingering AWSPENDING VersionIdsToStages A step failed mid-cycle Scheduled check / dashboard
secretsmanager-rotation-enabled-check AWS Config rule Rotation not even enabled Config dashboard / SNS
secretsmanager-secret-periodic-rotation AWS Config rule Not rotated within max age Config dashboard / SNS
{
  "source": ["aws.secretsmanager"],
  "detail-type": ["AWS Service Event via CloudTrail"],
  "detail": {
    "eventName": ["RotationFailed"]
  }
}

For staleness, AWS Config is the durable approach. Its managed rules secretsmanager-rotation-enabled-check and secretsmanager-secret-periodic-rotation flag secrets without rotation enabled or not rotated within a maximum age — far better than a custom script you will forget to maintain. Pair this with the CloudTrail/CloudWatch foundations in CloudWatch & CloudTrail Observability Deep Dive.

Architecture at a glance

The diagram traces a single rotation across two accounts, left to right. On the far left, the consumer account (444455556666) runs the app — or, better, RDS Proxy — which only ever reads AWSCURRENT through the caching client, and a consumer IAM role that must carry its own GetSecretValue + kms:Decrypt grant. Next is the owner account (111122223333), which holds the secret (with its AWSCURRENT/AWSPENDING versions), the resource policy that grants the consumer, and the customer-managed CMK — explicitly not aws/secretsmanager, because the managed key cannot be shared. The third zone is the rotation tier living inside the database VPC: the four-step rotation Lambda and the Secrets Manager interface VPC endpoint that lets that VPC-bound Lambda reach the SM API at all. The right-most zone is the data tier in private subnets: the Aurora PostgreSQL cluster on 5432 with its two alternating users, and the elevated master secret whose superuser clones the application user.

Follow the flows: the consumer calls GetSecretValue and receives whatever AWSCURRENT points at; Secrets Manager invokes the rotation Lambda through its four steps; the Lambda reaches the database on 5432 to ALTER USER, then writes the new version back with PutSecretValue. The five numbered badges sit on exactly the hops that stall in practice — the Lambda losing its route to the DB (1), the missing SM endpoint (2), the master secret needed to clone (3), the un-shareable managed key (4), and the consumer’s missing identity grant (5) — and the legend narrates each as symptom, how to confirm, and the fix. Read the badges as the failure map laid directly over the architecture.

Cross-account AWS Secrets Manager rotation architecture: a consumer account reading AWSCURRENT through the caching client and a consumer IAM role, the owner account holding the secret with its resource policy and a customer-managed KMS CMK, a rotation tier inside the database VPC with the four-step rotation Lambda and a Secrets Manager interface VPC endpoint, and a data tier of Aurora PostgreSQL on port 5432 with alternating users plus an elevated master secret — with numbered badges marking the five points where rotation stalls and a legend giving symptom, confirm, and fix for each.

Real-world scenario

A payments platform team — call them NorthPay — ran 40+ microservices against an Aurora PostgreSQL cluster, all sharing one application credential rotated single-user every 7 days. Every rotation produced a 20-to-90 second spike of 500s as services hit auth failures and slowly refreshed their caches. The on-call playbook literally said “ignore the Thursday 02:00 error spike” — which is exactly how a real incident later got ignored for eleven minutes, because operators had been trained to wave off the Thursday alarm.

The constraints were hard: they could not coordinate a synchronized restart of 40 services, and the payments hot path’s error budget (a 99.95% SLO, ~21 minutes/month) was being burned by the rotation spike alone — four cycles a month at up to 90 seconds each was a meaningful fraction of the entire budget, spent on a self-inflicted event. The fix had two moves. First, they switched the secret to alternating-user rotation with a master secret, so the promoted credential always belonged to a user that already existed with a valid password — eliminating the invalid-credential window. Second, they put RDS Proxy in front of the cluster so cutover happened once, at the proxy, instead of independently in 40 connection pools.

aws rds create-db-proxy \
  --db-proxy-name payments-proxy \
  --engine-family POSTGRESQL \
  --auth '[{"AuthScheme":"SECRETS","SecretArn":"arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/payments/db-app-XyZ","IAMAuth":"DISABLED"}]' \
  --role-arn arn:aws:iam::111122223333:role/payments-proxy-role \
  --vpc-subnet-ids subnet-aaa subnet-bbb

There was a second-order problem the team only found in staging: the rotation Lambda had originally been deployed outside the VPC (it had worked because the dev database was publicly reachable), and moving it into the production VPC immediately broke it with a bare timeout. The cause was the classic missing Secrets Manager interface endpoint — the VPC Lambda could now reach the database but had lost its path to the SM API. Adding com.amazonaws.us-east-1.secretsmanager with private DNS in the Lambda’s subnets fixed it in one change. They also moved the secret off aws/secretsmanager onto a CMK in the same motion, because a sibling analytics account needed read access and the managed key could never grant it.

The before/after, measured over a quarter:

Metric Before (single-user, no proxy) After (alternating-user + proxy)
Rotation-window 5xx 20–90 s spike every cycle Zero across the quarter
Error budget spent on rotation ~30% of monthly SLO ~0%
Services needing a restart on rotation Up to 40 (uncoordinated) 0 (cutover at proxy)
Cross-account analytics read Impossible (aws/sm key) Working (CMK grant)
“Ignore the Thursday spike” runbook line Present Deleted

After the change, rotation-window errors went to zero, and the “ignore the Thursday spike” line was deleted from the runbook — the real win, because a runbook that trains operators to ignore alerts is a latent incident.

Advantages and disadvantages

Automatic rotation is the right default for any credential that can rotate, but it is not free of cost or risk. The honest trade-off:

Advantages Disadvantages
Short-lived credentials shrink blast radius Adds a moving part (the Lambda) that can fail
Zero-downtime cutover (alternating-user) Networking/KMS wiring is unforgiving when wrong
Consumers follow AWSCURRENT automatically Silent failure is possible without monitoring
Managed functions cover most databases Custom targets need real engineering
Native audit via CloudTrail Per-secret + per-API + CMK cost at scale
Cross-account sharing without copying secrets Three-place authorization is easy to get wrong
Compliance: provable rotation cadence Stuck AWSPENDING needs operational hygiene

When each side dominates: for production database credentials on a hot path, the advantages are decisive — alternating-user + RDS Proxy turns the biggest risk (the cutover window) into a non-event, and the cost is rounding error against the cost of a leaked standing password. For a rarely-used, low-value, easily-revoked credential, the disadvantages can outweigh the benefit — a rotation Lambda you must maintain for a secret nobody attacks is overhead; a long manual-rotation interval may be enough. For third-party credentials with hard rate limits or fragile APIs, weigh the custom-rotator engineering against simply rotating manually on a calendar.

Hands-on lab

This lab stands up single-user rotation on an RDS PostgreSQL secret end to end, triggers a manual cycle, verifies the staging labels moved, and tears everything down. It assumes an existing RDS PostgreSQL instance reachable from where you run the rotation function; for a fully free-tier path, use a db.t3.micro (free tier eligible for 12 months) in a public subnet so you can skip the VPC endpoint for the lab (never do this in production).

The lab at a glance:

Step Action Command Expected result
1 Create the secret with engine metadata create-secret Secret ARN returned
2 Deploy the managed rotation function SAR / console Lambda ARN returned
3 Grant SM invoke on the Lambda add-permission Statement added
4 Enable rotation rotate-secret --rotation-lambda-arn VersionId for pending
5 Trigger a manual cycle rotate-secret All four steps run
6 Verify labels moved describe-secret New AWSCURRENT, no AWSPENDING
7 Confirm it authenticates get-secret-value + psql Login succeeds
8 Tear down delete-secret, delete Lambda Resources gone

Step 1 — create the secret with the engine metadata the managed function needs:

aws secretsmanager create-secret \
  --name lab/rotation/pg-app \
  --secret-string '{
    "engine":"postgres",
    "host":"lab-db.abc123.us-east-1.rds.amazonaws.com",
    "username":"app_user",
    "password":"InitialPassw0rd!",
    "dbname":"appdb",
    "port":5432
  }'

Step 2 — deploy the managed Postgres single-user rotation function from the Serverless Application Repository (the console “Edit rotation” wizard does this for you; the CLI path uses aws serverlessrepo create-cloud-formation-change-set). For the lab, the console wizard is fastest: open the secret → RotationEdit rotationCreate a new Lambda function → choose the single-user template.

Step 3 — grant Secrets Manager permission to invoke the function (the console does this; shown here for completeness):

aws lambda add-permission \
  --function-name SecretsManagerRotationLabFn \
  --statement-id AllowSM \
  --action lambda:InvokeFunction \
  --principal secretsmanager.amazonaws.com

Step 4 — enable rotation pointing at the function, on a 30-day schedule:

aws secretsmanager rotate-secret \
  --secret-id lab/rotation/pg-app \
  --rotation-lambda-arn arn:aws:lambda:us-east-1:111122223333:function:SecretsManagerRotationLabFn \
  --rotation-rules '{"ScheduleExpression":"rate(30 days)"}'

Step 5 — trigger a manual cycle now (runs all four steps immediately):

aws secretsmanager rotate-secret --secret-id lab/rotation/pg-app

Step 6 — verify the labels moved. This is the moment of truth:

aws secretsmanager describe-secret --secret-id lab/rotation/pg-app \
  --query '{Rotated:LastRotatedDate, Versions:VersionIdsToStages}'
# Expect: LastRotatedDate just now; one version AWSCURRENT, one AWSPREVIOUS, NO AWSPENDING.

Step 7 — confirm the live credential authenticates:

CRED=$(aws secretsmanager get-secret-value --secret-id lab/rotation/pg-app \
  --query SecretString --output text)
PGPASSWORD=$(echo "$CRED" | jq -r .password) \
  psql -h $(echo "$CRED" | jq -r .host) -U app_user -d appdb -c '\conninfo'
# "You are connected to database appdb" → rotation produced a working credential.

Step 8 — tear down so you stop paying for the secret and the instance:

aws secretsmanager delete-secret --secret-id lab/rotation/pg-app \
  --force-delete-without-recovery
aws lambda delete-function --function-name SecretsManagerRotationLabFn
# Delete the RDS instance from the console or:
aws rds delete-db-instance --db-instance-identifier lab-db \
  --skip-final-snapshot --delete-automated-backups

What each teardown action stops costing you:

Resource Lingering cost if left Teardown
Secret ~$0.40/month + API calls delete-secret --force-delete-without-recovery
Rotation Lambda Per-invocation (negligible idle) delete-function
RDS instance Hourly + storage delete-db-instance --skip-final-snapshot
CMK (if created) ~$1/month Schedule key deletion (7–30 day window)
VPC endpoint (if created) Hourly per ENI delete-vpc-endpoints

Common mistakes & troubleshooting

This is the section you keep open at 02:00. Each row is a real failure mode: the symptom, the root cause, the exact command or console path to confirm it, and the fix.

# Symptom Root cause Confirm (exact command / path) Fix
1 Lambda times out, secret stuck AWSPENDING No route to the DB or the SM API CloudWatch Logs: function ends mid-step; describe-secret shows lingering AWSPENDING Put rotator in DB VPC; add SM interface endpoint; open SG on DB port
2 Timeout before any step logs VPC Lambda has no Secrets Manager endpoint Logs show nothing; ENI exists but no 443 path Add com.amazonaws.<region>.secretsmanager interface endpoint + private DNS (or NAT)
3 setSecret fails: permission denied Alternating-user without masterarn / weak master Logs: permission denied to create/alter user; secret JSON lacks masterarn Add masterarn; master secret holds a superuser that can clone
4 Every step re-runs and desyncs Steps are not idempotent Logs show password regenerated each createSecret retry Guard createSecret on existing AWSPENDING; guard finishSecret on already-current
5 Consumer gets AccessDenied on GetSecretValue Missing/incorrect secret resource policy get-resource-policy; simulate-principal-policy Add resource policy granting the consumer account
6 Consumer gets AccessDenied on kms:Decrypt Secret on aws/secretsmanager (uneditable) describe-secret shows KmsKeyId: alias/aws/secretsmanager Re-encrypt with a CMK; grant consumer kms:Decrypt via kms:ViaService
7 Resource policy allows but consumer still denied Consumer identity policy missing the grant simulate-principal-policy from the consumer role Add GetSecretValue + kms:Decrypt to the consumer principal
8 App auth fails right after a clean rotation App pinned to a VersionId App config reads a fixed version, not AWSCURRENT Read AWSCURRENT (default); never pin a version
9 5xx spike every rotation cycle Single-user on a hot path App logs show auth failures clustered at rotation time Switch to alternating-user; front with RDS Proxy
10 Secret believed rotating but never does Rotation disabled or silently failing describe-secret: RotationEnabled:false or stale LastRotatedDate Enable rotation; add Config rule + RotationFailed alarm
11 rotate-secret returns but nothing changes SM can’t invoke the Lambda Lambda resource policy lacks secretsmanager.amazonaws.com Add lambda:InvokeFunction permission for the SM service principal
12 finishSecret runs forever / loops current_version never updated, retry logic wrong Logs: UpdateSecretVersionStage repeats Return early when current_version == token; move label once
13 Rotation works in dev, times out in prod Dev DB public; prod DB private, no endpoint Same Lambda, only network differs Add VPC config + SM endpoint + SG ingress in prod
14 KMS AccessDenied for the rotator itself Role missing Decrypt/GenerateDataKey on the CMK CloudTrail KMS event AccessDenied Grant the execution role both KMS actions on the CMK

The decision table for the most common ambiguous signal — “rotation isn’t working”:

If you see… It’s probably… Do this
Lingering AWSPENDING, logs stop at setSecret/testSecret No DB route or DB-side permission Check SG ingress + DB grants
Timeout with zero step logs No SM API egress (missing endpoint/NAT) Add SM interface endpoint
RotationEnabled:false Rotation never enabled rotate-secret --rotation-lambda-arn …
LastRotatedDate old, RotationEnabled:true Silent recurring failure Check RotationFailed events / logs
Consumer denied, owner can read fine Cross-account grant incomplete Verify all three: resource + KMS + identity
App fails only at rotation time Cache/version-pin problem Read AWSCURRENT; refresh on auth fail

Best practices

Security notes

Rotation is a security control, so its own posture must be tight. The least-privilege and isolation rules that matter:

Control What to do Why
Rotator role scope Limit to specific secret ARNs / tags, not * A compromised rotator shouldn’t read every secret
KMS key separation Per-domain CMKs, not one key for everything Blast-radius isolation; per-key audit
Cross-account least privilege Grant GetSecretValue to roles, not :root, where possible Narrow the principal that can read
kms:ViaService condition Restrict KMS use to Secrets Manager Key can’t be used out-of-band by the grantee
Network isolation Private subnets + interface endpoints DB and SM traffic never traverses the internet
No secrets in logs Never log secret values in the rotator CloudWatch Logs is not a secret store
CloudTrail on KMS + SM Log every Decrypt and GetSecretValue Detect anomalous reads
Resource policy review Audit who each secret is shared with Prevent accidental over-sharing
Master secret protection Tightest controls; superuser power It can clone/alter any DB user
Deletion recovery window Keep the default 7–30 day window in prod Guard against accidental/malicious deletion

The encryption and identity story end to end: the secret is envelope-encrypted under a CMK; the rotator decrypts with kms:Decrypt scoped to that key; consumers decrypt only through secretsmanager.<region>.amazonaws.com via the kms:ViaService condition; and every read is a CloudTrail event you can alarm on. For the deeper KMS model see AWS KMS Encryption Deep Dive and Multi-Region KMS Keys & Envelope Encryption; for the IAM evaluation order behind the three-place authorization, IAM Fundamentals: Users, Roles, Policies, Evaluation and IAM Least Privilege & Permission Boundaries.

Cost & sizing

Secrets Manager pricing is simple but adds up at scale; rotation introduces Lambda and (usually) CMK and endpoint costs on top. The drivers:

Cost driver Rate (us-east-1, approx) Scales with Notes
Secret storage ~$0.40 / secret / month Number of secrets The base line item
API calls ~$0.05 / 10,000 calls Read volume The caching client slashes this
Rotation Lambda Per-invocation + duration Rotations × 4 steps Tiny; 4 short invocations per cycle
Customer-managed CMK ~$1 / key / month + API Number of CMKs Use shared CMKs per domain to bound it
SM interface endpoint ~$0.01/hr per ENI + per-GB AZs × endpoints One per AZ in the Lambda subnets
NAT (if used instead) Hourly + per-GB egress Egress volume Often costlier than an endpoint

Right-sizing guidance and rough monthly figures (INR at ~₹84/USD, indicative):

Scenario Secrets CMKs Endpoints Rough USD/mo Rough INR/mo
Single app, one DB secret 1 0 (managed key) 0 (public dev) ~$0.40 ~₹35
Small prod, private DB 3 1 1 (1 AZ) ~$8–10 ~₹700–850
Multi-account platform 30 3 6 (2 AZ ×3) ~$60–80 ~₹5,000–6,700
High-read fleet (no cache) 10 1 2 ~$20 + heavy API ~₹1,700 + API
Same fleet, caching client 10 1 2 ~$15 ~₹1,250

The free-tier and cost-control levers worth knowing:

Lever Effect Caveat
30-day free trial per secret First month free Per secret, one-time
Caching client / Lambda extension Cuts API calls ~90%+ Implement refresh-on-auth-fail
Shared CMK per domain Fewer $1/month keys Coarser blast radius
Interface endpoint over NAT No per-GB egress on SM traffic Pay per-ENI-hour instead
RDS Proxy holds the secret App makes zero SM calls Proxy has its own hourly cost
Consolidate related values One secret, fewer line items Don’t bundle unrelated trust domains

The cost of not rotating — a leaked standing credential leading to data exfiltration — dwarfs all of the above. Treat rotation cost as insurance premium, not overhead.

Interview & exam questions

Q1. Walk me through the four steps of a rotation Lambda. createSecret generates the new value and stores it as AWSPENDING; setSecret applies it to the backing service; testSecret authenticates with the pending value; finishSecret moves the AWSCURRENT label onto the pending version. Each step must be idempotent because Secrets Manager retries.

Q2. Why is alternating-user rotation safer than single-user for a hot path? Single-user changes the password on one user, creating a window where the old cached credential is invalid until the app refreshes. Alternating-user swaps between two users, so the newly promoted credential belongs to a user that already existed with a known-good password — there is no invalid window.

Q3. A rotation Lambda for a private RDS instance times out with no useful error. What’s the most likely cause? It’s in the VPC and can reach the database but has no path to the Secrets Manager API — it’s missing a Secrets Manager interface VPC endpoint (or a NAT route). The classic tell is a timeout before any step logs.

Q4. Why can’t you use the aws/secretsmanager key for a cross-account secret? Its key policy is AWS-managed and not editable, so you cannot grant another account kms:Decrypt. Cross-account secrets must use a customer-managed CMK whose key policy you can edit to permit the consumer account.

Q5. A consumer account has a resource policy granting it the secret but still gets AccessDenied. Why? Authorization is evaluated independently in each account. The consumer’s own identity-based policy must also allow secretsmanager:GetSecretValue (and kms:Decrypt). Resource policy and identity policy must both allow.

Q6. What is masterarn and when is it required? It points to an elevated master/superuser secret. Alternating-user rotation needs it because cloning a user and granting its privileges requires rights the application user does not have; the Lambda authenticates with the master secret to provision the clone.

Q7. Why must rotation steps be idempotent, and where does it bite hardest? Secrets Manager retries on failure. If createSecret regenerates a password on every retry, the stored secret desyncs from what setSecret applied. Guard createSecret on an existing AWSPENDING and finishSecret on an already-current version.

Q8. How do you keep applications from failing during the rotation window? Read AWSCURRENT via the caching client (never pin a VersionId), refresh the cache on auth failure, prefer alternating-user rotation, and front shared DB credentials with RDS Proxy so cutover happens once at the proxy.

Q9. How do you detect a secret that silently stopped rotating? Alarm on the absence of RotationSucceeded, not just on RotationFailed. Use AWS Config rules secretsmanager-rotation-enabled-check and secretsmanager-secret-periodic-rotation, and check LastRotatedDate against the schedule.

Q10. In a custom rotator where the provider issues the key, when do you delete the old credential? Only in finishSecret, after AWSCURRENT has moved. Deleting the old key in createSecret or setSecret kills it before promotion and causes an outage.

Q11. What lingering AWSPENDING tells you, and how to recover. A step failed mid-cycle — usually setSecret/testSecret couldn’t reach the DB, or a permission was missing. Confirm with describe-secret’s VersionIdsToStages, fix the underlying cause (network/permission), then re-trigger rotate-secret.

Q12. Which IAM permissions does the rotation execution role need beyond the obvious secret actions? secretsmanager:GetRandomPassword (on *), kms:Decrypt and kms:GenerateDataKey on the CMK, plus the VPC ENI permissions (via the AWS managed Lambda-VPC policy) when it runs in a VPC.

These map to AWS Certified Security – Specialty (data protection, key management, incident response) and AWS Certified Solutions Architect – Professional (secure multi-account design). The cross-account KMS and resource-policy material is a recurring Security-Specialty theme.

Quick check

  1. Which staging label do applications get by default from GetSecretValue, and which one only exists during a rotation?
  2. You enable rotation on a private RDS secret and the Lambda times out before any step logs. What’s missing?
  3. A spoke account has a secret resource policy granting it access but still gets AccessDenied on GetSecretValue. Name the two things that must also be true.
  4. Why does alternating-user rotation eliminate the rotation-window auth errors that single-user can cause?
  5. In a custom rotator for a provider that issues the key, at which step do you delete the old credential, and why not earlier?

Answers

  1. Applications get AWSCURRENT by default; AWSPENDING exists only during a rotation cycle (and a lingering one means a step failed).
  2. A Secrets Manager interface VPC endpoint (or a NAT route) in the Lambda’s subnets — the VPC Lambda can reach the DB but has no path to the SM API.
  3. The encrypting key must be a customer-managed CMK granting the consumer kms:Decrypt (not aws/secretsmanager), and the consumer’s own identity-based policy must allow GetSecretValue + kms:Decrypt.
  4. Because the promoted credential belongs to a second user that already existed with a known-good password, so there is never a moment where the current credential is invalid while the app catches up.
  5. In finishSecret, after AWSCURRENT has moved — deleting it earlier kills the credential before promotion and causes an outage.

Glossary

Next steps

awssecrets-managerrotationsecurityrds
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments