Architecture Multi-cloud

HashiCorp Vault as Central Secrets Broker for Multi-Cloud Workloads

A mid-sized payments processor — call it the kind of fintech that clears card transactions for a few thousand merchants and is two audits away from a banking licence — has a secrets problem that finally became a board problem. Their workloads sprawl across all three clouds for honest reasons: the core ledger runs on AWS, the analytics estate grew up on GCP BigQuery, and a recent acquisition dragged in an Azure footprint nobody wants to migrate this year. Each cloud has its own secret store, each team has its own conventions, and the result is roughly 1,400 long-lived credentials scattered across .env files, CI variables, Kubernetes Secrets, and at least one wiki page. Then a routine review found an AWS access key, last rotated in 2022, sitting in a public-facing repository’s git history. Nothing was breached — this time — but the auditor’s finding was blunt: you cannot tell us who can access what, the credentials never expire, and you have no central place to revoke. The CISO’s mandate landed the next morning: one broker, every cloud, short-lived everything, full audit. This article is the reference architecture for building that with self-hosted HashiCorp Vault.

The pressures are the ones that always force this decision. Audit demands a single answer to “who accessed which secret, when, and under what identity” — impossible when secrets live in four places with four logging models. Blast radius demands that a leaked credential expire on its own, because a static key in a repo is a breach waiting for a date; a 15-minute dynamic credential is a non-event. Operational sprawl demands one workflow instead of three cloud-specific ones, so a platform team of eight can actually keep up with forty product teams. And least privilege demands credentials minted on demand, scoped to a job, and destroyed after — not handed out once and trusted forever. Vault is the pattern that satisfies all four: a central broker that generates credentials on the fly against each cloud and database, hands them out against a verified workload or human identity, and revokes them centrally — so a secret stops being a static artifact you store and starts being a short-lived lease you issue.

Why not the obvious shortcuts

Three cheaper answers will be proposed in the first planning meeting, and each fails in a way worth naming so the room can move past them.

“Just use each cloud’s native secret store.” AWS Secrets Manager, Azure Key Vault, and GCP Secret Manager are all good products — but using all three gives you three audit trails, three access models, three rotation mechanisms, and zero answer to a cross-cloud question. Worse, they mostly store static secrets you still have to rotate yourself; the dynamic-credential story is partial and cloud-locked. You would be standardizing on fragmentation.

“Rotate the static keys more often with a script.” Rotation scripts help, but a key that is valid for 24 hours is still a 24-hour window if it leaks, and rotation jobs fail silently, get disabled during incidents, and never cover everything. You are reducing the blast radius, not eliminating the class of problem.

“Bake secrets into the CI system’s variables.” This just moves the static secrets into Jenkins or GitHub Actions and makes the CI platform the highest-value target in the company. A compromised pipeline now holds production cloud credentials for every account.

Vault threads the needle by inverting the model. Instead of storing a credential and protecting it forever, Vault holds a parent credential (or a workload identity federation trust) and uses it to mint a fresh, scoped, time-boxed child credential every time a workload or human asks — verified against an identity Vault trusts, logged centrally, and revocable with one API call. The secret you would have leaked no longer exists as a durable thing.

Architecture overview

HashiCorp Vault as Central Secrets Broker for Multi-Cloud Workloads — architecture

The platform has one logical core and many spokes. The core is a Vault cluster running the integrated Raft storage backend — five nodes for production HA quorum — that owns no persistent secret material it cannot regenerate. The spokes are the consumers: human operators, CI pipelines, and workloads on EKS, AKS, and GKE, each authenticating with a different method but all arriving at the same broker.

The defining property of the topology is the one the auditor cares about: no consumer holds a long-lived cloud or database credential. Workloads present a verifiable identity, Vault checks it against a configured auth method, a policy decides what they may request, and Vault returns a credential with a lease — a TTL after which Vault revokes it whether or not the workload remembers to. Every step lands in the audit log.

Where to run the cluster. Place Vault in a dedicated, hardened account/subscription/project — for this fintech, a security-tooling AWS account in the primary region, peered to the other clouds over private connectivity (Transit Gateway / VNet peering / VPC peering through a partner interconnect, never the public internet). Vault is a Tier-0 system; it gets its own blast-radius boundary, its own tight network policy, and its own on-call. Expose it only on a private endpoint fronted by an internal load balancer.

Control flow, the human path:

  1. An SRE runs vault login against the internal endpoint. Vault’s OIDC auth method is wired to Okta as the workforce IdP, so the operator authenticates with their normal Okta credentials and adaptive MFA — no separate Vault password to manage or offboard.
  2. Okta returns an ID token; Vault validates it, reads the group claims, and maps the operator’s Okta groups to Vault policies (payments-prod-readonly, db-admin, and so on). The operator never gets a standing cloud credential; they get a short Vault token scoped to exactly what their group allows.
  3. When the operator needs, say, temporary read access to a production AWS account, they hit the AWS secrets engine and Vault calls sts:AssumeRole to return a 60-minute STS credential. It expires on its own.

Control flow, the workload path (the high-volume one):

  1. A pod starts on EKS. It carries a projected Kubernetes ServiceAccount token. The Vault Agent injector (a sidecar/init container, installed via the Vault Helm chart) authenticates to Vault using the Kubernetes auth method, which validates that ServiceAccount token against the cluster’s API server.
  2. Vault’s policy for that ServiceAccount says it may read database creds. The Agent calls the database secrets engine, which connects to the workload’s PostgreSQL with an admin credential Vault holds, runs a CREATE ROLE ... VALID UNTIL statement, and returns a brand-new username/password valid for one hour.
  3. The Agent writes that credential into a tmpfs file the app reads; it never lands in a Kubernetes Secret or an environment variable visible in a kubectl describe. The Agent renews the lease until the pod dies, then the credential is revoked.
  4. The identical pattern runs on AKS and GKE — each cluster registered as its own Kubernetes auth mount — so a workload on any cloud gets database and cloud credentials through one broker, one policy model, one audit log.

The unseal path (the part that makes it operable):

Vault encrypts its storage with a master key that is itself encrypted (the “seal”). On every restart a sealed Vault is useless until unsealed. Manual unseal with Shamir key shares is a 3-AM ritual nobody wants, so this design uses auto-unseal via a cloud KMS — Vault asks AWS KMS (Azure Key Vault / GCP Cloud KMS in the respective clouds) to decrypt the unseal key on startup. This trades a manual ceremony for a dependency on KMS, which is the right trade for a system that must self-heal.

Component breakdown

Concern Vault mechanism What it does here Key configuration choices
Cluster storage / HA Integrated Raft (5 nodes) Self-contained HA quorum, no external storage to operate retry_join across 3 AZs; odd node count for quorum
Auto-unseal KMS seal Decrypts the master key on restart so Vault self-heals seal "awskms" with a dedicated CMK; IAM/identity scoped to one key
Human auth OIDC → Okta SSO + MFA for operators; group-to-policy mapping Okta groups in the token; bound_claims to Vault policies
Workload auth Kubernetes auth (×3) Verifies EKS/AKS/GKE ServiceAccount tokens One auth mount per cluster; bound to namespace + SA
CI auth JWT/OIDC auth GitHub Actions / Jenkins get short tokens, no stored creds GitHub OIDC sub claims bound to repo/branch
AWS credentials AWS secrets engine Mints STS / IAM creds per request assumed_role type; per-role policy; short max TTL
Azure credentials Azure secrets engine Mints service-principal creds per request App registration; role assignment scope per workload
GCP credentials GCP secrets engine Mints OAuth tokens / SA keys per request Prefer access tokens over SA keys; tight bindings
Database credentials Database secrets engine Generates per-pod DB users with VALID UNTIL Per-role creation SQL; 1h default TTL; max TTL cap
TLS / mTLS PKI secrets engine Issues short-lived certs for service mesh mTLS Intermediate CA per env; 24–72h cert TTL
Secret delivery Vault Agent injector Templates secrets to tmpfs, renews leases Helm-installed; init + sidecar; tmpfs only
Posture / IaC scan Wiz / Wiz Code Flags Vault misconfig and any static secret in IaC Agentless scan; Wiz Code on Terraform PRs
Runtime security CrowdStrike Falcon Runtime protection on Vault nodes and clusters Sensor on node pools; detections to the SOC
Observability Datadog / Dynatrace Telemetry on lease counts, seal status, request latency Vault metrics endpoint; alert on seal/leadership change
ITSM / change ServiceNow Change records for policy/engine changes; break-glass tickets Change gate on policy merges; auto-ticket on seal event
Provisioning Terraform / Ansible Stand up cluster + configure mounts, roles, policies Terraform Vault provider; Ansible for node config

A few choices carry the weight of the design and deserve the why.

Why integrated Raft, not Consul. Older Vault deployments used a separate Consul cluster for storage, which meant operating two distributed systems and getting both quorums right. Integrated (Raft) storage folds storage into Vault itself: one cluster, one quorum, one thing to back up. For a team that wants Vault to be a tool and not a second career, this removes an entire operational surface. The cost is that Vault now owns its own storage durability, which is why snapshots (below) are non-negotiable.

Why dynamic secrets are the whole point. A static secret stored in any vault is still a static secret — its blast radius is “forever until someone rotates it.” A dynamic secret is generated per request with a lease, so the AWS credential a pod uses to read S3 exists for one hour and then Vault deletes the underlying IAM/STS grant. The mental shift is from storing and protecting secrets to issuing and expiring them. This is what turns the leaked-credential class of finding from “critical” into “the credential expired before the scanner finished indexing.”

Why auth method per identity type, not one shared token. It is tempting to issue one Vault token to a cluster and let everything share it. Don’t — that recreates the static-credential problem inside Vault. Each kind of caller authenticates with the method that can cryptographically prove its identity: humans via Okta OIDC, pods via the Kubernetes API’s token review, CI via GitHub’s OIDC. Identity is verified at the source every time, and policy is attached to that verified identity.

Implementation guidance

Provision with Terraform, configure with the Vault provider, and treat the seal and the quorum as the first two deliverables. Order matters: a cluster that cannot auto-unseal or cannot form quorum is not a cluster.

  1. A dedicated network boundary (security account/subscription/project) with private connectivity to the consumer clouds and an internal load balancer in front of the Vault nodes. No public ingress, ever.
  2. A dedicated KMS key for auto-unseal, with the Vault nodes’ instance identity granted Decrypt/Encrypt on only that key. This key is itself Tier-0; losing it means losing the cluster.
  3. The five Vault nodes across three AZs (2-2-1), configured for Raft with retry_join, behind the internal LB. Initialize once, store the recovery keys split among officers offline.
  4. The auth methods and secrets engines, configured as code through the Terraform Vault provider so policies, roles, and mounts are reviewable and reproducible — never click-configured in the UI.

A minimal HCL config for the node communicates the intent — Raft storage, KMS auto-unseal, TLS on the listener:

storage "raft" {
  path    = "/opt/vault/data"
  node_id = "vault-prod-1"
  retry_join { leader_api_addr = "https://vault-prod-2.internal:8200" }
  retry_join { leader_api_addr = "https://vault-prod-3.internal:8200" }
}

seal "awskms" {
  region     = "ap-south-1"
  kms_key_id = "arn:aws:kms:ap-south-1:111122223333:key/unseal-cmk"
}

listener "tcp" {
  address       = "0.0.0.0:8200"
  tls_cert_file = "/etc/vault/tls/vault.crt"
  tls_key_file  = "/etc/vault/tls/vault.key"
}

api_addr     = "https://vault-prod-1.internal:8200"
cluster_addr = "https://vault-prod-1.internal:8201"

The cluster nodes themselves are configured with Ansible (package, config file, systemd unit, TLS material), while every Vault-internal object — mounts, roles, policies, auth bindings — is Terraform. Keep that line crisp: Ansible builds the boxes, Terraform configures the secrets platform. The Terraform that touches Vault policy passes through a ServiceNow change gate so a policy widening is a reviewed, recorded change, not a quiet merge.

Wire the database engine for per-pod users. This is the highest-volume engine for most enterprises. Vault holds one admin credential per database and generates ephemeral users on demand:

# Vault provider — a role that mints 1-hour Postgres users
resource "vault_database_secret_backend_role" "ledger_ro" {
  backend     = "database"
  name        = "ledger-readonly"
  db_name     = "ledger-postgres"
  default_ttl = 3600   # 1 hour
  max_ttl     = 14400  # hard cap 4 hours
  creation_statements = [
    "CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}';",
    "GRANT SELECT ON ALL TABLES IN SCHEMA public TO \"{{name}}\";"
  ]
}

A pod that needs read access now gets a unique, expiring database user — so a compromised pod’s database credential is useless an hour later, and the audit log shows exactly which pod held which generated username.

Kill the static cloud keys. For each cloud, mount its secrets engine and bind roles to least-privilege cloud roles. On AWS, prefer assumed_role so Vault returns STS credentials rather than minting standing IAM users. On GCP, prefer OAuth access tokens over service-account keys — an access token cannot be exfiltrated and reused for weeks the way a downloaded SA key can. On Azure, scope each service-principal role assignment to the narrowest resource group the workload needs. Set aggressive max_ttl on every role; the credential should outlive the job by minutes, not days.

Federate the humans, federate the pipelines. Operators reach Vault through Okta OIDC with adaptive MFA, and their Okta group membership drives Vault policy — so a leaver removed from an Okta group instantly loses the Vault access that group granted, with no separate Vault offboarding step. CI is the same idea with a machine identity: GitHub Actions authenticates via OIDC (its workflow sub claim bound to a specific repo and branch) and Jenkins via the JWT auth method, each receiving a short-lived Vault token scoped to exactly the secrets that pipeline needs — so there are no cloud credentials stored in CI at all, which closes the “CI is the crown-jewel target” gap directly.

Enterprise considerations

Security & Zero Trust. The architecture is Zero Trust by construction — no implicit trust, every request authenticated against a verified identity, every credential short-lived and least-privileged. Layer on top: (a) the PKI secrets engine issuing short-lived (24–72h) certificates for service-mesh mTLS, with an offline root and an intermediate CA per environment, so service-to-service traffic is mutually authenticated with certs that rotate faster than an attacker can use a stolen one; (b) Wiz running continuous posture scanning across the Vault infrastructure and Wiz Code scanning Terraform on every PR to catch a hardcoded secret or an over-broad policy before it merges — the backstop that would have caught the original leaked key at commit time; © CrowdStrike Falcon sensors on the Vault nodes and the consumer clusters for runtime threat detection, feeding the fintech’s SOC, because a Tier-0 system warrants runtime visibility, not just network controls; (d) tight audit-device configuration so every request — granted or denied — is logged to a tamper-evident sink, and a seal event or a leadership change auto-raises a ServiceNow incident. Critically, Vault’s own audit log is append-only and must ship off-box to the SIEM; an attacker who reaches a Vault node must not be able to erase the record of what they did.

Cost optimization. Self-hosting Vault is a deliberate cost tradeoff against HashiCorp’s managed HCP Vault, and the math is worth doing honestly.

Lever Mechanism Typical effect
Self-host vs HCP Run Raft on your own VMs vs pay for managed Lower licence/SaaS spend, higher ops burden
Right-size the cluster 5 small nodes serve enormous request volumes Vault is rarely CPU-bound; don’t over-provision
Lease TTL tuning Shorter TTLs = more churn; balance against API load Too-short TTLs hammer Vault and the backends
Consolidate stores Retire per-cloud secret stores Vault replaces Removes duplicate tooling and its audit overhead
Open-source vs Enterprise OSS covers core; Enterprise adds DR replication, namespaces, HSM Pay only when you need the Enterprise features below

The honest line for the CFO: self-hosted Vault trades a SaaS bill for the salary cost of a team that can operate a Tier-0 distributed system. For an eight-person platform team already running Kubernetes, that trade usually favors self-hosting; for a four-person team, HCP Vault is often the cheaper total cost once on-call and upgrade toil are counted.

Scalability. Vault scales reads with performance standby nodes — non-leader nodes that serve read-only requests, so the high-volume “give me a database credential” path fans out across the cluster while writes funnel to the single Raft leader. The natural ceiling is the leader’s write throughput and the lease count Vault tracks in memory, which is why TTL discipline matters: a million simultaneous one-second leases is its own denial-of-service. For very large or geographically split estates, Vault Enterprise performance replication stands up read-scaled secondary clusters in other regions/clouds. Each consumer cluster (EKS/AKS/GKE) is just another auth mount, so adding a fourth or fortieth cluster is configuration, not re-architecture.

Failure modes, and what each one looks like. Name them before they page you.

Reliability & DR (RTO/RPO). Decide the numbers explicitly. Within a region, the 5-node Raft cluster across 3 AZs tolerates a node or AZ loss with seconds of leader-election RTO and zero RPO. Cross-region/cross-cloud DR is where the OSS-vs-Enterprise line bites: OSS DR means restoring a Raft integrated snapshot (taken on a schedule, shipped to object storage in another cloud) into a standby cluster — a real RTO of tens of minutes and an RPO equal to your snapshot interval. Vault Enterprise disaster-recovery replication keeps a warm standby cluster continuously synced for a much tighter RTO. A pragmatic target for this fintech on OSS: RTO 30 minutes, RPO 10 minutes, achieved with 10-minute automated snapshots to a second cloud and a rehearsed restore runbook. Whatever you choose, rehearse the restore — an untested snapshot is a hope, not a recovery plan.

Observability. Enable Vault’s telemetry and scrape it into Datadog (or Dynatrace), and alert on the signals that actually predict an outage: seal status (any node sealing), leadership changes (Raft instability), request latency and error rate per mount, and active token and lease counts (the explosion early-warning). Ship the audit log to the SIEM as the authoritative who-did-what record. Emit the business-meaningful metrics too — credentials issued per cloud, per-engine request rate, and lease TTL distribution — so the platform team can see, for instance, the equities team’s database-credential request rate spike before it becomes an incident.

Governance. Manage all Vault policy as code through the Terraform Vault provider, reviewed in pull requests and gated through ServiceNow change records, so a policy that widens access is a recorded, approved change. Use Vault Enterprise namespaces if you need hard multi-tenant isolation between business units sharing one cluster. Pin and stagger Vault version upgrades (Raft makes rolling upgrades straightforward, but quorum-aware), and keep the audit log retained for the period your regulator demands. Every secrets engine, every role, and every policy should trace back to a reviewed commit — the inverse of the wiki-page-and-.env sprawl that started this.

Explicit tradeoffs

Accept these or do not build it. Centralizing on Vault makes it a Tier-0 single point of dependency: when Vault is down, nothing can get a fresh credential, so its availability must equal your most critical workload’s, and that is real operational weight — a 5-node distributed system to patch, snapshot, and keep in quorum. Dynamic secrets add moving parts the simpler world did not have: lease renewal logic in every workload, TTLs to tune (too long defeats the purpose, too short hammers the backends), and a database admin credential Vault must hold and protect. Auto-unseal trades a manual ceremony for a hard runtime dependency on cloud KMS — convenient until KMS is the thing that is down. And the multi-cloud private connectivity that keeps Vault traffic off the internet is its own networking project. None of this is free; all of it is cheaper than the breach the static-key sprawl was setting up.

The alternatives, and when they win. If you live entirely in one cloud and will for the foreseeable future, that cloud’s native secret store plus its workload identity federation (IRSA on AWS, Workload Identity on GKE, managed identities on Azure) gives you keyless workload access without operating Vault — and you should take it. If your team is small and cannot staff a Tier-0 distributed system, HCP Vault (HashiCorp’s managed offering) buys you the same capabilities without the upgrade-and-quorum toil, at a SaaS price. If your need is narrowly secret storage with simple rotation and you have no appetite for dynamic credentials, a managed secret store is enough and Vault is over-engineering. Vault earns its complexity precisely when you have multiple clouds, a real audit mandate, and a blast-radius problem with static credentials — which is exactly the corner this payments processor painted itself into.

The shape of the win

For the fintech, the payoff is not “we installed Vault.” It is that six months later the same auditor asks “show me who can access the production ledger database and prove the credentials expire,” and the platform team answers with one query against one audit log, points to the per-pod 1-hour database users and the Okta-group-to-policy mapping, and demonstrates that the leaked-key finding is now structurally impossible because the credentials a scanner could find no longer exist as durable artifacts. That is the sentence that closes the audit and clears the path to the banking licence. Everything upstream — the Raft quorum, the KMS auto-unseal, the Kubernetes auth per cluster, the dynamic cloud and database engines, the Okta federation, the Wiz scanning, the CrowdStrike sensors, the Datadog seal-status alerts — exists to let one small team tell one true story about every secret in three clouds. The architecture here is the destination; if you must start narrower, start with the database engine on the one cloud that hurts most, but this is where a multi-cloud, audited, blast-radius-bounded secrets program has to land.

HashiCorp VaultMulti-cloudSecrets ManagementKubernetesZero TrustEnterprise
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading