A national health-insurance payer fails a HITRUST re-certification on a single finding: a service account password for the claims-adjudication database had not changed in 1,300 days, was pasted into four Jenkins jobs, and lived in plaintext in a Confluence runbook that 200 people could read. The auditor’s note is blunt — “no evidence of credential rotation; shared static secrets with no individual attribution.” The board gives the platform team two quarters and a number to hit: no human-known long-lived credential to any production datastore or cloud control plane, every rotation evidenced, and a break-glass path that is itself audited. The team has tried “rotate the passwords every 90 days” before; it died because rotation broke applications nobody could find, and the manual change tickets were a fiction written after the fact. This article is the reference architecture for doing it properly: a HashiCorp Vault dynamic-secret rotation program where credentials are generated on demand, leased, and revoked, every rotation writes a real ServiceNow change record, break-glass is a sealed and monitored exception, and Datadog watches lease health so the program is observable instead of hoped-for.
The pressures here are the ones that sink every naive rotation attempt. Compliance (HITRUST, SOC 2, PCI-DSS, HIPAA) demands evidence — not a policy document, but a per-credential audit trail showing who got what, when, and that it expired. Blast radius means a single leaked static secret currently grants standing access to PHI for an attacker who finds it in a log or a laptop. Operational reality means thousands of workloads across three clouds and a fleet of databases, many owned by teams that will not hand-edit a connection string on your schedule. And change governance means the company’s auditors expect every production change — including a credential rotation — to flow through the ServiceNow CMDB with an approval trail. Static secrets satisfy none of these. The shift is architectural: stop storing credentials and start minting them, so that the default lifetime of any secret is measured in minutes and revocation is a first-class operation rather than a password reset nobody dares to run.
Why static-secret rotation keeps failing
Three “fixes” get proposed on every one of these programs, and each fails predictably — naming why matters, because someone will champion all three.
A secrets manager holding rotated static passwords (the “vault as a password box” model) reduces sprawl but keeps the fundamental defect: the secret is still long-lived, still copied into application config at deploy time, still valid the instant it leaks, and rotation still requires every consumer to re-read it in lockstep — which is exactly the coordination problem that breaks production. You have moved the password, not retired it.
Scheduled bulk rotation — a cron job that resets every service password each quarter — fails on the long tail of unknown consumers. The job rotates the database password; an unmonitored batch script on a forgotten VM that still holds the old one starts failing at 2 a.m.; the on-call engineer’s fastest fix is to roll the password back, and now the rotation program has taught the organization that rotation causes outages. Nobody trusts it again.
“We’ll just use cloud-native IAM roles everywhere” is correct where it reaches — and it does not reach legacy databases, third-party SaaS APIs, on-prem virtual appliances, or the COTS application whose only auth is a static username and password. You end up with a clean story for new cloud-native workloads and an un-rotated swamp everywhere else, which is precisely where the audit findings live.
Dynamic secrets thread the needle. Vault holds a single privileged root credential per backend (which only Vault ever uses, and which Vault itself rotates), and on each request it creates a brand-new, uniquely named credential scoped to exactly what the caller needs, hands it back with a lease (a TTL), and revokes it — drops the database user, deletes the IAM access key — when the lease expires or is explicitly killed. The credential a workload uses today did not exist yesterday and will not work tomorrow. Leaked secrets become near-worthless because they are already expired; rotation stops being a coordinated event and becomes the steady-state behavior of the system; and revoking one compromised app’s access is a single API call that touches nothing else.
Architecture overview
The program runs three conceptually distinct flows that share the Vault core but live on different schedules, and keeping them separate is the first step to operating it sanely: a request-time dynamic-secret flow that serves workloads, a control-and-governance flow that emits change records and watches lease health, and a tightly sealed break-glass flow for the cases dynamic secrets cannot cover.
The defining property of the whole topology is the one the auditor cares about most: no human and no application ever holds a standing credential to a protected backend. The only long-lived secrets in the system are Vault’s per-backend root credentials, stored inside Vault’s encrypted barrier and rotated by Vault on a schedule, plus a small, sealed set of break-glass static credentials that are themselves under audit. Everything a workload uses is leased and ephemeral.
Request-time dynamic-secret flow, following the control path:
- A workload — a claims-processing service on Kubernetes, a Jenkins or GitHub Actions pipeline job, an Argo CD sync — needs database or cloud access. It authenticates to Vault not with a password but with a workload identity: the Kubernetes auth method validates the pod’s projected ServiceAccount token, or the JWT/OIDC auth method validates a GitHub Actions OIDC token, or cloud auth validates an instance identity. There is no bootstrap secret to leak — identity is proven by the platform.
- Vault maps that identity to a policy and a role. The role names the backend and the exact entitlement: “a Postgres user with
SELECT, INSERTon theclaimsschema, TTL 1 hour,” or “an AWS IAM credential with this least-privilege policy, TTL 15 minutes.” Policy is attached to identity, not to a shared account. - The relevant secrets engine mints the credential. The database secrets engine runs the configured creation SQL against the datastore using Vault’s root connection, producing a uniquely named user like
v-k8s-claims-x7Qp-1717.... The AWS / Azure / GCP secrets engine calls the cloud IAM API to generate a short-lived access key or federated credential. Vault returns the credential plus a lease ID and TTL. - The workload uses the credential. The Vault Agent sidecar (or CSI provider) handles auth, fetching, and template rendering, and — crucially — renews the lease before expiry for long-running workloads and re-fetches a fresh credential when renewal is no longer allowed. The application reads a credential from a file or environment, never from a checked-in config.
- When the lease expires, is explicitly revoked, or the pod dies, Vault runs the configured revocation SQL (or deletes the cloud key), and the credential ceases to exist at the backend. This is rotation: not editing a password, but the continuous birth and death of uniquely-attributed credentials.
Control-and-governance flow, the part that turns rotation into evidence:
- Every credential-affecting event — a root-credential rotation, a dynamic-secret issue, a manual rotation of a static secret that cannot yet go dynamic, a break-glass checkout — is captured from Vault’s audit device (a tamper-evident, hash-chained log) and, for the events the change process cares about, drives the creation of a ServiceNow change record via Vault’s outbound integration. A scheduled root rotation opens a standard change (pre-approved, low-risk) that auto-closes on success; a break-glass checkout opens an emergency change that pages an approver. The CMDB now holds the documented, approvable trail the auditor demanded — each rotation is a ticket, not a story.
- Datadog ingests Vault’s telemetry and audit stream and renders the program’s health: count of leases by backend, lease TTL distribution, revocation success/failure, renewals nearing the max-TTL ceiling (the early warning that a workload is about to be cut off), token and lease counts trending toward limits, and seal status. A rotation program you cannot see is a rotation program that has already silently broken somewhere.
Break-glass flow, sealed and exceptional: for the genuinely un-automatable cases — a vendor virtual appliance whose admin login cannot be scripted, a disaster scenario where Vault itself is unavailable — a small set of static credentials lives in a Vault KV path behind a separate, stricter policy requiring a quorum approval (Vault’s control-group or an external approval), every checkout writes a ServiceNow emergency change and a Datadog event, and the credential is mandatorily rotated immediately after use. Break-glass is not a convenience; it is an audited exception with an alarm attached.
Component breakdown
| Component | Service / tool | Role in the program | Key configuration choices |
|---|---|---|---|
| Secrets core | HashiCorp Vault (HA, integrated storage) | Mints, leases, renews, revokes all credentials; encrypted barrier; audit log | Raft storage; auto-unseal via cloud KMS; per-namespace tenancy |
| Human identity | Okta + Microsoft Entra ID | Operator SSO to Vault UI/CLI; MFA and conditional access on privileged paths | OIDC auth method; group → policy mapping; step-up MFA for break-glass |
| Workload identity | Kubernetes / JWT-OIDC / cloud auth | Secretless authentication of pods, pipelines, and instances | Bound ServiceAccounts; GitHub OIDC sub claims; instance identity docs |
| DB credentials | Vault database secrets engine | On-demand Postgres/MySQL/Oracle/MSSQL users, dropped on lease expiry | Creation/revocation SQL per role; 1h default TTL; root rotated by Vault |
| Cloud credentials | Vault AWS/Azure/GCP secrets engines | Short-lived IAM access keys / federated creds, least-privilege per role | 15–60 min TTL; STS/assumed-role where possible; one role per entitlement |
| Static & break-glass | Vault KV v2 + control groups | The un-automatable long tail and emergency access, under quorum approval | Versioned KV; mandatory rotate-after-use; quorum + ServiceNow gate |
| Agent / injection | Vault Agent + Secrets Store CSI | Auth, fetch, template, renew, re-fetch on the workload side | Sidecar auto-auth; file/env templates; cache + renewal loop |
| ITSM / change | ServiceNow | A change record per rotation; CMDB attribution; emergency-change on break-glass | Standard change for scheduled rotation; emergency for break-glass; auto-close on success |
| Observability | Datadog | Lease health, revocation success, TTL/seal dashboards, alerting | Vault integration + audit log pipeline; monitors on revocation failure & max-TTL approach |
| CI / CD consumers | Jenkins, GitHub Actions, Argo CD | Pipelines fetch ephemeral creds at run time instead of storing them | OIDC/AppRole auth to Vault; no secrets in pipeline config or env files |
| IaC / config | Terraform + Ansible | Declarative Vault config (engines, roles, policies) and host enrollment | vault Terraform provider for policy/roles; Ansible for agent rollout |
| CSPM / IaC scan | Wiz + Wiz Code | Detects exposed/standing secrets in cloud & repos; verifies no public Vault surface | Agentless scan for plaintext keys; Wiz Code in PRs; attack-path on secret exposure |
| Runtime security | CrowdStrike Falcon | Runtime protection on Vault nodes and workloads; detects credential abuse | Sensor on Vault cluster + node pools; detections to the SOC |
| Edge | Akamai | TLS, WAF, and access control in front of the Vault API for remote/partner callers | Private origin to internal Vault LB; WAF on the auth endpoints |
| LMS evidence | Moodle | Tracks operator training/attestation on break-glass and rotation procedures | Completion records linked to who may approve emergency change |
A few of these choices deserve the why, because they are the ones programs get wrong.
Why dynamic secrets beat “rotate the static one.” A rotated static secret is valid for its whole window and must be re-read by every consumer at the same instant — the coordination that breaks production. A dynamic secret is valid for an hour, is unique to one caller, and is revoked individually; there is no fleet-wide flag day, and a leaked credential is usually already dead. The cost is that the backend must support programmatic user creation (most databases and all major clouds do) and the workload must tolerate a credential that changes — which the Vault Agent’s renew/re-fetch loop handles transparently.
Why a change record per rotation, not a quarterly attestation. Auditors do not accept “we rotate regularly”; they accept evidence that this credential rotated at this time with this approval. Wiring Vault’s lifecycle events to ServiceNow makes the CMDB the system of record: a standard (pre-approved) change for the routine high-volume rotations so you are not paging a human for every lease, and an emergency change — with an actual approver — for the rare break-glass event. The trail is generated by the system as the rotation happens, which is the difference between real evidence and a back-dated spreadsheet.
Why break-glass must exist and must hurt to use. Pretending every credential can be dynamic is how programs ship a back door nobody documented. Some appliances and DR scenarios need a static fallback. The discipline is to make that path expensive and loud: quorum approval, MFA step-up via Okta/Entra, an automatic emergency change, a Datadog event, and mandatory rotation immediately after. Break-glass that is easy is just the old static-secret problem wearing a costume.
Implementation guidance
Provision Vault and its config with Terraform; treat policy as code. Stand up a Vault HA cluster on integrated (Raft) storage with auto-unseal backed by a cloud KMS so a node restart does not require a human with unseal keys at 3 a.m. Use namespaces for tenancy so the claims platform, the pharmacy platform, and the corporate-IT teams get isolated mounts and policies under one cluster. Then define every secrets engine, role, and policy through the Terraform vault provider — never by hand — so the entitlement model is reviewable, diffable, and reproducible.
A database role makes the dynamic-secret idea concrete — Vault holds the root, and this is the user it mints and drops per lease:
resource "vault_database_secret_backend_role" "claims_ro" {
backend = vault_mount.db.path
name = "claims-readonly"
db_name = "claims_pg"
default_ttl = 3600 # 1 hour
max_ttl = 14400 # renew up to 4 hours, then re-fetch
creation_statements = [
"CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}';",
"GRANT USAGE ON SCHEMA claims TO \"{{name}}\";",
"GRANT SELECT ON ALL TABLES IN SCHEMA claims TO \"{{name}}\";"
]
revocation_statements = [
"REVOKE ALL PRIVILEGES ON ALL TABLES IN SCHEMA claims FROM \"{{name}}\";",
"DROP ROLE IF EXISTS \"{{name}}\";"
]
}
Tell Vault to own the root connection and rotate it so no human ever knows the privileged password again — run this once at enrollment:
vault write database/rotate-root/claims_pg
After that command, the database superuser password Vault used to connect is changed to a value only Vault holds inside its barrier. The original password in the runbook is now permanently invalid — which is the finding the auditor wanted closed, closed for good.
Authenticate workloads by identity, never by a bootstrap secret. On Kubernetes, enable the Kubernetes auth method and bind a ServiceAccount to a Vault role; the Vault Agent sidecar proves the pod’s identity with its projected token and renders credentials to a file the app reads. For CI, use GitHub Actions OIDC bound to a specific repo and branch (sub claim) so a pipeline gets exactly the entitlement that pipeline needs and nothing else — and Jenkins jobs use the JWT or AppRole method with a short-lived response-wrapped token. Argo CD fetches sync-time credentials the same way. The principle is uniform: the platform vouches for the workload, Vault issues a lease, and no static secret is ever planted to bootstrap the chain. Ansible handles the one-time host enrollment and agent rollout across the fleet.
Wire the governance integration deliberately. Configure Vault’s audit device first — a rotation program with no audit log is non-compliant on day one — and ship that log to Datadog alongside Vault’s telemetry. Stand up the ServiceNow integration so that: scheduled root rotations and bulk static-secret rotations open and auto-close a standard change; a break-glass checkout opens an emergency change that must be approved by someone whose Moodle training record shows they are certified for it. Build the Datadog monitors that actually matter — revocation failures (a credential that should be dead is still alive), leases approaching max_ttl (workloads about to be cut off), seal status, and audit-log gaps.
Enterprise considerations
Security & Zero Trust. The architecture is Zero Trust about credentials by construction: no standing access, least-privilege per role, individually attributed and time-bounded secrets, and revocation as a one-call operation. Layer on top: (a) Okta/Entra OIDC for operator login with conditional access and step-up MFA on privileged and break-glass paths, so a stolen operator session cannot quietly drain the vault; (b) Wiz and Wiz Code scanning cloud accounts and source repos for the very thing this program eliminates — plaintext keys, long-lived IAM users, secrets committed to git — and flagging any drift back toward standing credentials, plus verifying Vault itself never gains a public data-plane surface; © CrowdStrike Falcon sensors on the Vault cluster nodes and consuming workloads for runtime threat detection and credential-abuse signals, feeding the SOC; (d) Akamai terminating TLS and enforcing WAF/access policy in front of the Vault API for any remote or partner caller, with a private origin to the internal Vault load balancer. A guardrail breach — a revocation failure, an unexpected break-glass checkout — auto-raises a ServiceNow incident so security gets a ticket, not just a log line.
Cost optimization. Vault’s licensing and the operational footprint dominate, and dynamic secrets add per-request work, so engineer for both.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Right-size TTLs | Longer default TTL + agent renewal for steady workloads; short TTL only where risk demands | Fewer credential-creation calls and backend user churn |
| Agent caching | Vault Agent caches and renews leases instead of re-fetching | Cuts request volume to the Vault core and the backends |
| Standard vs. emergency change | Pre-approved standard change for routine rotations; humans only on emergencies | Avoids per-rotation approval labor that kills adoption |
| Namespace consolidation | One HA cluster, many namespaces, vs. a cluster per team | Big saving on infra and operational overhead |
| Static-roles for the long tail | Vault static-role rotation where a true dynamic user is impractical | Gets rotation evidence without per-request minting cost |
| Self-service roles via IaC | Teams add roles through reviewed Terraform PRs | Removes the platform team as a per-request bottleneck |
Pipe Vault and backend-cost metrics to Datadog so the platform team can see whether an aggressively short TTL is buying real risk reduction or just generating database-user churn and cost.
Scalability. The Vault core scales with performance standby nodes (read/issue traffic served by standbys, writes forwarded to the active node) and, at very large scale, performance replication to regional clusters so workloads authenticate against a local Vault. The real scaling pressure is the backends: minting a database user per request stresses the datastore, so use agent caching, sensible TTLs, and static roles (Vault rotates a fixed set of users on a schedule) where per-request creation would overwhelm a fragile legacy database. Cloud secrets engines hit IAM API rate limits — prefer STS/assumed-role patterns that do not create durable principals. Size the audit-log and Datadog pipeline for the event volume; a busy cluster emits a lot of audit lines.
Failure modes, and what each one looks like. Name them before they page you.
- Vault is sealed or unreachable — no workload can fetch or renew a credential, and as leases expire applications lose access cluster-wide. Mitigation: HA with Raft and auto-unseal, performance standbys, agent caching with generous renewal windows, and the audited break-glass static path for true DR.
- Backend revocation fails — a credential that should be dead lives on (the database was unreachable when the lease expired), leaving an orphaned privileged user. Mitigation: Vault retries revocation and tracks failures; a Datadog monitor on revocation errors pages, and a periodic reconciliation drops orphaned
v--prefixed users. - Max-TTL cutoff — a long-running job renews until it hits
max_ttl, then is abruptly denied and fails mid-run. Mitigation: design jobs to re-fetch (not just renew), setmax_ttlagainst real job duration, and alert when renewals approach the ceiling. - Root-rotation lockout —
rotate-rootruns but the new password is not durably committed (a crash mid-rotation), and Vault can no longer connect to the backend. Mitigation: rotate roots in a controlled change window, verify connectivity immediately after, and keep a documented (sealed) recovery path for the root specifically. - Break-glass abuse — the emergency path becomes a convenient bypass. Mitigation: quorum approval, MFA, mandatory post-use rotation, a Datadog event and ServiceNow emergency change every time, and periodic review of who used it and why.
Reliability & DR (RTO/RPO). Decide the numbers per tier. Vault’s own state (Raft) replicates across the cluster for near-zero RPO; for regional loss, disaster-recovery replication to a secondary cluster gives a promotable standby. A pragmatic target: RTO 15 minutes, RPO near-zero for the Vault control plane, with the audited break-glass static credentials as the last-resort path to the most critical datastores if Vault is wholly unavailable during the incident. Crucially, test the break-glass procedure on a schedule — an emergency path nobody has exercised is a liability, not a safety net — and gate who may run it on a current Moodle training/attestation record.
Observability. Instrument the program so its health is a dashboard, not a faith statement. In Datadog, track active leases per backend, lease creation and revocation rates, revocation failure count (the metric that catches orphaned credentials), renewals approaching max-TTL, token/lease counts versus configured limits, seal status, and audit-log continuity. Emit the business metrics the auditor and the CISO want: percentage of production credentials that are dynamic vs. static (the program’s north-star number, trending toward zero static), count of un-rotated static secrets remaining, mean credential lifetime, and break-glass checkouts this quarter. Every break-glass event and every revocation failure raises a ServiceNow record automatically, so governance has a ticket trail and security has an alert.
Governance. Keep all Vault policy, roles, and engine config in version control and promote it through reviewed Terraform PRs scanned by Wiz Code, so the entitlement model is auditable and instantly revertable. Pin and review creation/revocation SQL — a sloppy GRANT is a privilege-escalation bug. Use Vault namespaces and per-team policies so the principle of least privilege holds at the tenancy boundary, not just per role. Log every credential lifecycle event to the immutable audit device and retain it to the compliance regime’s horizon. And make the ServiceNow change record the system of record for rotation evidence, so a HITRUST or SOC 2 assessor pulls the trail from the CMDB rather than from a screenshot.
Explicit tradeoffs
Accept these or do not build it. Dynamic secrets add real moving parts: an agent or CSI provider on every workload, backends that must support programmatic user creation, an audit and observability pipeline you have to run, and a Vault cluster that is now a tier-0 dependency — if it is down, workloads cannot get credentials, which raises the operational stakes considerably. There is per-request cost and backend churn that short TTLs amplify, and a legacy database with a fragile connection limit may not tolerate per-request user creation at all, forcing you onto static roles for that slice. The Okta/Entra-to-Vault federation and the ServiceNow change integration are both setup and maintenance overhead that a ten-service shop can skip and a regulated enterprise absolutely cannot. And break-glass — the honest admission that not everything can be dynamic on day one — is itself a control surface you must monitor forever.
The alternatives, and when they win. If you are entirely cloud-native on one provider, native IAM roles and short-lived federated credentials (IAM Roles for Service Accounts, workload identity federation) cover most of this without Vault — and you should use them as Vault’s cloud backends rather than reinventing them. If your only goal is to stop committing secrets to git, a lighter secrets manager with native rotation (cloud Secrets Manager / Key Vault with rotation functions) is simpler, at the cost of keeping credentials long-lived. If the corpus of secrets is tiny and static, the full program is over-engineering — start with eliminating plaintext and enforcing MFA. Graduate to this Vault-and-ServiceNow program when you have many backends across multiple clouds plus a legacy long tail, and an auditor who wants per-rotation evidence and an attributed break-glass trail — which is exactly the situation that fails the certification in the first place.
The shape of the win
For the payer, the payoff is not “we bought Vault.” It is that the next HITRUST assessor asks for evidence of credential rotation on the claims database and the team pulls a ServiceNow change trail showing that the credential rotates continuously, that no human knows the database’s privileged password, that the one break-glass checkout last quarter was approved, MFA-gated, used, and rotated within the hour — and that a Datadog dashboard has shown the program green the whole time. The finding that failed the re-certification is not just remediated; it is structurally impossible to recur, because there is no longer a static password to leave un-rotated. Everything upstream — the dynamic secrets engines, the workload-identity auth, the Okta/Entra MFA, the Wiz scanning for drift, the ServiceNow change records, the Datadog lease-health monitors, the sealed break-glass path — exists to make a HITRUST assessor, a CISO, and an on-call engineer each trust the same thing: that credentials are born, used briefly, and die, on their own, every day. Start narrower if you must — kill plaintext, dynamic-ize one critical database, prove the change-record flow — but this is where an enterprise secrets program has to land.