Architecture Platform

Vault PKI as Enterprise Private CA for Service mTLS

A health-insurance carrier’s platform team is told, six weeks before an external HITRUST audit, that the assessor will sample east-west traffic between microservices and ask for proof that it is encrypted and mutually authenticated. The reality on the ground is the usual brownfield mess: 380 services across four Kubernetes clusters, TLS terminated at the ingress and then plaintext HTTP inside the mesh, a wiki page listing a handful of certificates that “someone” generated with openssl two years ago, two of which expired last month and took a payments callback down for ninety minutes before anyone connected the outage to a cert. The carrier handles PHI under HIPAA, the security team has a standing finding that “service-to-service identity is unmanaged,” and the audit clock is non-negotiable. The ask crystallizes into one sentence: every service must present a verifiable identity to every other service, certificates must rotate themselves, and no human should ever copy a private key again. This article is the reference architecture for building that — a HashiCorp Vault private CA issuing short-lived certificates that drive mesh mTLS, designed so a HITRUST assessor and a platform on-call engineer are both satisfied.

The pressures here are the ones that always sink hand-rolled PKI. Trust means a service must be able to prove who it is, not just assert a hostname, so a compromised pod cannot impersonate the claims-adjudication service. Lifecycle means certificates that expire on a schedule the platform controls and renew without a pager going off — the opposite of the expired-cert outage that started this. Scale means thousands of workload identities across clusters and environments, far past what a spreadsheet and openssl can track. And blast radius means that if any single signing key is exposed, the damage is bounded to one environment for a short window, not “reissue the entire estate.” A private certificate authority built on Vault’s PKI secrets engine satisfies all four, because it turns certificate issuance into an API call governed by policy, audited centrally, and scoped tightly — instead of a manual ritual nobody can reconstruct.

Why not the obvious shortcuts

Three alternatives will be proposed in the first design meeting, and each fails in a way worth naming.

A public CA (Let’s Encrypt / a commercial CA) for internal services is the wrong tool: public CAs only sign publicly-resolvable names, rate-limit aggressively, and have no concept of your internal SPIFFE-style service identities. You do not want claims-service.internal in a public Certificate Transparency log, and you cannot get a 24-hour internal cert for an unroutable name from them anyway. A long-lived static CA with manually-minted certs — the current state — is exactly what is failing the audit: keys get copied into Git and laptops, expiry is a surprise, and revocation is theoretical because nobody knows where every copy lives. Self-signed certs per service with no shared trust gives you encryption but not authentication — every service trusts everything, so mTLS becomes security theater that an assessor sees through immediately.

Vault PKI threads the needle. It is a CA as an API: a workload (or cert-manager on its behalf) authenticates with its own identity, calls an issuing endpoint constrained by a role, and receives a freshly-signed, short-lived leaf certificate plus the chain — with the private key generated at the edge and never transiting Vault if you choose. Issuance is policy-gated, every signature is in the audit log, certificate lifetime is measured in hours, and the signing keys for each environment are isolated so a breach is contained.

Architecture overview

Vault PKI as Enterprise Private CA for Service mTLS — architecture

The design is a three-tier CA hierarchy feeding an automated issuance loop on Kubernetes. Holding the two halves apart in your head — the trust hierarchy (who signs whom) and the issuance flow (how a running pod gets a cert) — is the key to operating it.

The defining property of the whole topology is the one the assessor cares about most: the root CA is offline and air-gapped; nothing online can sign with it. The root signs exactly one kind of thing — environment intermediate CAs — and then goes back in the safe. Day-to-day issuance is done by online intermediates that can be revoked and re-issued from the root without ever rebuilding the trust store on every workload.

Trust hierarchy, top to bottom:

  1. An offline root CA with a 10-year lifetime, generated on an air-gapped host. Its private key never touches a network-connected machine. In Vault terms this is most safely done by generating the root outside Vault entirely (or in a dedicated Vault instance that is then sealed and powered off), exporting only the public certificate.
  2. One intermediate CA per environmentdev, staging, prod, and a separate prod-pci-style enclave for the payments path — each living in its own Vault PKI secrets-engine mount. Each intermediate is signed by the offline root once, with a 1–3 year lifetime, and is the thing that actually signs workload certs. Separate mounts mean separate keys, separate policies, separate audit trails, and independent revocation.
  3. Short-lived leaf certificates for every workload, signed by the relevant environment intermediate, with lifetimes of 24–72 hours and a SPIFFE-style identity URI in the SAN (spiffe://prod.health.internal/ns/claims/sa/adjudicator). Nobody renews these by hand; the issuance loop does, continuously.

Issuance flow, following a pod from birth to traffic:

  1. A new pod for the claims-adjudication service schedules on the prod cluster. Its Kubernetes ServiceAccount is its identity.
  2. cert-manager (running in-cluster) sees a Certificate resource for that workload and calls Vault’s PKI issuing endpoint through a Vault Issuer. It authenticates to Vault using the Kubernetes auth method — presenting the pod’s ServiceAccount token, which Vault validates against the cluster’s token reviewer — so the pod proves its identity to Vault before any cert is issued.
  3. Vault checks the bound policy and PKI role: this Vault role for prod only allows the claims namespace to request the claims SPIFFE identity, caps the TTL at 72 hours, and forbids wildcard or out-of-namespace SANs. If the request conforms, the prod intermediate signs a leaf; if not, it is denied and logged.
  4. cert-manager writes the signed cert, chain, and key into a Kubernetes Secret, and renews it at ~2/3 of its lifetime automatically — so a 72-hour cert is replaced roughly every 48 hours, long before expiry.
  5. The service mesh (Istio or Linkerd) mounts that Secret — or, in the cleaner pattern, Vault is wired in as the mesh’s own CA so the mesh’s sidecars (Envoy) obtain identities directly. Every pod-to-pod call now negotiates mTLS: both sides present a Vault-signed cert, both verify the other’s chain back to the environment intermediate and the offline root, and both check the peer’s SPIFFE identity against an authorization policy.
  6. The claims service calls the eligibility service; Envoy on each side does the mutual handshake transparently. Plaintext east-west traffic is gone, and both ends are authenticated — which is the sentence the HITRUST assessor needs.

Component breakdown

Component Service / tool Role in the platform Key configuration choices
Root of trust Offline root CA (air-gapped) Signs environment intermediates only; never online 10-yr lifetime; key generated off-network; powered off between signings
Issuing CAs Vault PKI secrets engine (one mount per env) Signs all workload leaf certs for its environment 1–3 yr intermediate; separate mount/key/policy per env; tuned max_lease_ttl
Cert automation cert-manager (in-cluster) Requests, stores, and auto-renews leaf certs Vault Issuer; renew at 2/3 TTL; one Certificate per workload identity
Workload auth Vault Kubernetes auth method Lets a pod prove identity via its ServiceAccount token Token reviewer; role bound to namespace + SA; short token TTL
Service identity SPIFFE SAN URIs Cryptographic name for each service spiffe://<env>.../ns/<ns>/sa/<sa>; enforced by PKI role allow-list
Data plane mTLS Istio / Linkerd (Envoy sidecars) Negotiates and enforces mutual TLS on every hop STRICT mTLS mode; Vault as mesh CA or mounted leaf; SPIFFE authz policies
Human + automation auth Microsoft Entra ID + Okta SSO/OIDC for operators and CI into Vault OIDC auth method to Vault; group → Vault policy mapping; MFA on root ops
Edge TLS Akamai Public-facing TLS termination, WAF, anycast at the perimeter Public CA cert at edge; origin to mesh ingress; internal CA stays private
CI / GitOps GitHub Actions + Argo CD + Terraform Provisions Vault/PKI as code; deploys mesh + cert-manager config via GitOps Vault provider; OIDC to cloud (no stored creds); Argo syncs mesh policy
Config mgmt Ansible Drives the offline-root ceremony and intermediate signing runbook Idempotent playbook for the air-gapped host; signs CSR, exports cert
Posture / IaC scanning Wiz + Wiz Code Flags PKI/mesh misconfig and exposed keys; scans IaC pre-merge Detects long-TTL roles, disabled mTLS, secrets in repos; attack-path view
Runtime security CrowdStrike Falcon Runtime threat detection on nodes and the Vault hosts Sensor on node pools + Vault VMs; detections to the SOC
Observability Dynatrace / Datadog Cert-expiry, issuance-rate, and handshake-failure telemetry Vault PKI metrics; cert-expiry SLO alerts; mesh mTLS success-rate dashboards
ITSM / approvals ServiceNow Change gate for root/intermediate ops; auto-ticket on PKI alerts Change record for any root signing; auto-incident on cert-expiry or auth spike
Workforce LMS Moodle Hosts the PKI runbook training the on-call rotation must complete Tracks completion of the offline-root and revocation runbooks

A few choices deserve the why, because they are where teams get burned.

Why an offline root and online intermediates, not one online CA. If the only CA is online and it is compromised, you must rebuild the trust anchor on every workload in the estate — a multi-day, multi-team fire drill. With an offline root, the root key is never exposed because it is never on a network. If an intermediate is compromised, you revoke just that intermediate, sign a fresh one from the offline root, and only that one environment re-issues — the rest of the estate, and the root, are untouched. The cost is the ceremony of bringing the root out to sign an intermediate, which happens rarely and is exactly the kind of high-ceremony, low-frequency event a runbook and a ServiceNow change record are built for.

Why short-lived leaf certs instead of revocation lists. Classic PKI leans on CRLs and OCSP to revoke long-lived certs, and both are operationally painful at mesh scale — CRLs go stale, OCSP adds a network dependency to every handshake. The cleaner answer is short lifetimes: if a leaf cert lives 48–72 hours and renews continuously, a compromised cert is useless within hours regardless of whether revocation propagated. You still keep CRLs for the intermediates (a short, rarely-changing list), but you mostly stop revoking leaves and let expiry do the work. This is the single design decision that makes mesh PKI tractable.

Implementation guidance

Provision Vault and the PKI engines with Terraform, but do the root by ceremony. The order matters: the trust hierarchy is built once, top-down, and getting the intermediate chain wrong means every workload fails verification with an opaque error.

  1. Stand up Vault (HA, auto-unseal via a cloud KMS), then enable a separate PKI secrets-engine mount per environment: pki_int_dev, pki_int_staging, pki_int_prod, pki_int_prod_pci.
  2. Generate the root offline, via an Ansible runbook on the air-gapped host, and export only its public certificate.
  3. For each environment, generate an intermediate CSR inside Vault (the private key never leaves that mount), carry the CSR to the offline root, sign it, and import the signed intermediate back into Vault. This is the one manual hop, gated by a ServiceNow change record.
  4. Define a tightly-scoped PKI role per environment that pins allowed SANs, identity URIs, and a hard max_ttl.
  5. Enable the Kubernetes auth method per cluster and bind policies that let each namespace request only its own identity.

A minimal Terraform shape for the prod intermediate mount and its role communicates the intent — short TTLs, namespace-scoped, no wildcards:

resource "vault_mount" "pki_int_prod" {
  path                      = "pki_int_prod"
  type                      = "pki"
  max_lease_ttl_seconds     = 7776000   # 90d ceiling for anything this CA signs
  description               = "Prod issuing CA — signs workload leaf certs only"
}

resource "vault_pki_secret_backend_role" "claims" {
  backend          = vault_mount.pki_int_prod.path
  name             = "claims"
  max_ttl          = "72h"              # leaves expire fast; renewal does the work
  allow_subdomains = true
  allowed_uri_sans = ["spiffe://prod.health.internal/ns/claims/*"]
  allowed_domains  = ["claims.prod.svc.cluster.local"]
  allow_wildcard_certificates = false
  key_type         = "ec"
  key_bits         = 256
}

Wire cert-manager to Vault, and let it renew. cert-manager’s Vault Issuer authenticates with the Kubernetes auth method; from there each workload gets a Certificate resource and cert-manager owns the renewal clock. The critical field is renewBefore — set it so renewal happens at roughly two-thirds of the lifetime, giving a wide safety margin against the expiry that caused the original outage:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: claims-adjudicator
  namespace: claims
spec:
  secretName: claims-adjudicator-tls
  duration: 72h
  renewBefore: 24h                       # renew with a third of the lifetime to spare
  uris:
    - spiffe://prod.health.internal/ns/claims/sa/adjudicator
  issuerRef:
    name: vault-prod
    kind: Issuer

Identity: federate the humans, bind the workloads. Operators never get a static Vault token. Human access to Vault is through the OIDC auth method federated to Microsoft Entra ID (with Okta as the upstream workforce IdP where the carrier standardizes on it), mapping IdP groups to Vault policies — so a platform engineer’s group grants read on PKI status but not the ability to sign from the root, and any root operation additionally requires MFA and a second approver. Workloads, in turn, authenticate only via the Kubernetes auth method bound to their ServiceAccount, so identity is the pod’s, not a shared secret. The result: no long-lived Vault token exists for a human or a service to leak.

Enterprise considerations

Security & Zero Trust. This architecture is the identity layer of Zero Trust east-west: every service call is mutually authenticated and encrypted, and authorization is keyed on cryptographic SPIFFE identity rather than network position, so a pod that lands on the cluster is not trusted merely for being inside it. Layer on top: (a) STRICT mTLS in the mesh so a service with no valid cert simply cannot talk — there is no plaintext fallback to exploit; (b) PKI roles that allow-list SANs and identity URIs so a compromised namespace cannot mint a cert claiming to be the payments service; © Wiz and Wiz Code continuously scanning for the dangerous drifts — a PKI role with a year-long leaf TTL, an intermediate with no name constraints, mTLS quietly downgraded to PERMISSIVE, or a private key committed to a repo — and surfacing the attack path before an assessor or attacker does; (d) CrowdStrike Falcon sensors on the Vault hosts and node pools for runtime detection, because the Vault servers are now the carrier’s crown jewels; (e) a Vault audit-log spike in denied issuance or root-mount access auto-raises a ServiceNow incident so security gets a ticket, not just a log line. Vault’s audit device logs every issuance and auth event, which is the artifact that turns “we have mTLS” into “here is the signed proof of who issued what, when.”

Cost optimization. Unlike a per-cert commercial CA, an internal Vault PKI has near-zero marginal cost per certificate — you are paying for Vault’s compute and operations, not per-signature. The economics flip the usual instinct.

Lever Mechanism Typical effect
Self-hosted issuance Vault PKI signs unlimited internal certs at no per-cert fee Eliminates per-cert commercial CA spend entirely
Short TTLs over OCSP infra Let expiry replace revocation for leaves Drops the cost/complexity of running OCSP responders at scale
EC over RSA keys ec/P-256 keys are cheaper to sign and verify Lower CPU on every handshake across thousands of pods
Mesh-native CA wiring Vault as the mesh CA, not per-pod Secrets plumbing Less custom automation to build and maintain
Right-sized Vault HA Three-node HA with KMS auto-unseal, not over-provisioned Avoids paying for a fleet to do a control-plane job

The real saving is in incidents avoided: the expired-cert outage that opened this article had a direct revenue and reputational cost that dwarfs the Vault footprint, and Dynatrace/Datadog cert-expiry SLOs make a recurrence structurally unlikely.

Scalability. Issuance scales horizontally with Vault’s HA cluster; the PKI engine signs thousands of certs per minute, well past mesh churn. cert-manager scales per cluster, so adding clusters adds issuance capacity rather than centralizing a bottleneck. The natural ceiling is Vault availability — if Vault is unreachable, new pods cannot get certs (existing pods keep their valid certs and traffic until renewal). Mitigate by running Vault HA across availability zones with KMS auto-unseal, and by choosing leaf TTLs long enough (48–72h) that a short Vault outage is invisible to running workloads — the buffer between renewal-time and expiry is your tolerance for control-plane downtime.

Failure modes, and what each one looks like. Name them before they page you.

Reliability & DR (RTO/RPO). Vault’s storage backend (the durable record of mounts, policies, and the intermediate keys) must be backed up and replicated — Vault Enterprise DR replication or regular snapshots to object storage. Because the offline root is the ultimate anchor and lives outside the running system, even total loss of the online Vault is recoverable: stand up fresh Vault, restore the snapshot, and if needed re-sign intermediates from the offline root. A pragmatic target: RTO 30 minutes to restore issuance from snapshot, RPO near-zero for PKI config via replication — and crucially, running workloads keep serving traffic on their existing valid certs throughout, so a control-plane outage is not a data-plane outage as long as it is shorter than the leaf TTL buffer. The offline root’s public cert and signing capability are themselves backed up to physical, off-network media.

Observability. Export Vault’s PKI and auth telemetry to Dynatrace (or Datadog) and build the dashboards the platform team actually lives by: certificate expiry (the leading indicator — alert on anything inside its renewal window that has not renewed), issuance rate and denial rate (a denial spike means a misbound policy or an attack), mTLS handshake success rate per service (the data-plane health signal), and Vault auth failures by auth method (a Kubernetes-auth failure cluster means a token-reviewer or binding problem). Tie a sustained cert-expiry or auth-failure breach to an auto-raised ServiceNow incident. The on-call rotation completes the PKI runbook training hosted in Moodle before joining, so the human who gets paged at 3 a.m. has actually walked the revocation and root-ceremony procedures.

Governance. Treat the PKI as code: PKI mounts, roles, and policies live in Terraform and change only through reviewed pull requests, with Wiz Code scanning the IaC pre-merge for a role that quietly widens TTL or drops a name constraint. Mesh authorization policies (which SPIFFE identity may call which) are GitOps-managed in Argo CD, reviewable and instantly revertable. Every root and intermediate operation is gated by a ServiceNow change record with a second approver. Vault’s audit device retains the full issuance and auth history as the compliance artifact, and a quarterly access review maps Entra/Okta groups to Vault policies so no operator quietly accrues root-signing rights.

Explicit tradeoffs

Accept these or do not build it. A private CA on Vault is real operational surface: you now run and protect Vault as tier-0 infrastructure, you own an offline-root ceremony with a runbook and physical custody, and you must keep cert-manager and the mesh healthy or new pods cannot get identities. Short-lived certs mean issuance happens constantly, so a Vault outage that outlasts your TTL buffer does eventually stop new workloads — the buffer is a tolerance window, not immunity. mTLS adds a small per-connection handshake cost and a real debugging-complexity cost: “it works without mTLS” is no longer a meaningful test, and a chain or SAN mismatch fails closed with an error that is opaque until you learn to read it. The offline-root model trades convenience for blast-radius control — re-signing an intermediate is a deliberate ceremony, by design, not a one-click action.

The alternatives, and when they win. If you are all-in on a single service mesh and do not need certs outside it, the mesh’s built-in CA (Istio’s Citadel, Linkerd’s identity) is simpler — reach for Vault when you want one CA spanning multiple meshes, VMs, and non-mesh workloads, or when central audit and policy across environments is the requirement. If you are on a single cloud and your workloads are all first-class cloud identities, a managed cloud CA (AWS Private CA, GCP CAS, Azure equivalents) offloads the operational burden — Vault wins on multi-cloud neutrality, on SPIFFE-native identity, and on keeping the CA under your own key custody rather than the cloud’s. And if you genuinely have a handful of long-lived services, a carefully-run static internal CA can suffice — but it is precisely the model that failed this carrier’s audit, and it does not scale to 380 services across four clusters without becoming the spreadsheet-and-openssl trap again.

The shape of the win

For the carrier, the payoff is not “we installed Vault.” It is that the HITRUST assessor samples traffic between the claims and eligibility services, asks for proof it is encrypted and mutually authenticated, and gets a Vault audit record showing exactly which short-lived, SPIFFE-named certificate each side presented, signed by the prod intermediate, chaining to an offline root that no online system can touch — and that the cert that served that request was minted four hours ago and will be gone by tomorrow. The finding that “service-to-service identity is unmanaged” closes. The expired-cert outage class is gone because nothing waits for a human to renew. And the next time an environment’s key is suspect, the answer is “revoke that intermediate, re-sign from the root, re-issue one environment” instead of a multi-day estate-wide rebuild. Everything upstream — the offline root, the per-environment intermediates, the Kubernetes-auth binding, the cert-manager renewal loop, the Wiz drift scanning, the Dynatrace expiry SLO — exists so that an auditor, a CISO, and a platform engineer on call each say yes. Start with one environment and one mesh if you must; this hierarchy is where managed, at-scale service identity has to land.

HashiCorp VaultPKImTLSKubernetescert-managerZero Trust
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading