A payments company runs three clusters that have to talk to each other and were never meant to: an EKS cluster in AWS that hosts the public API, an on-prem OpenShift cluster that holds the ledger because the auditors will not let it leave the data centre, and a fleet of bare-metal virtual appliances (a Postgres HA pair and a message broker) that predate Kubernetes entirely. Today the API authenticates to the ledger with a shared bearer token that has been copy-pasted into four Kubernetes Secrets, a Jenkins credential, and — discovered during the last audit — a wiki page. The token has not been rotated in 18 months because nobody is sure what would break. The mandate from security is blunt: every service-to-service call gets mutual TLS, every identity is short-lived and attested, and there are no more long-lived shared secrets anywhere. This guide builds exactly that with SPIFFE and SPIRE: a SPIRE server per trust domain issues cryptographic SVIDs to workloads after attesting what they actually are, the two trust domains federate so a workload in one can verify a workload in the other, and the bearer token gets deleted for good.
SPIFFE (Secure Production Identity Framework For Everyone) is the spec — a SPIFFE ID like spiffe://payments.aws/ns/api/sa/checkout names a workload, and an SVID (SPIFFE Verifiable Identity Document, an X.509 cert or JWT) proves it. SPIRE is the production implementation: a server that mints SVIDs from a signing CA, and an agent on every node that attests the node and the workloads on it before handing them their identity over a local API. The result is workload identity that is issued just-in-time, expires in minutes, and is rooted in what a workload is (its Kubernetes service account, its AWS instance role, its binary path) rather than a secret it holds.
Prerequisites
- Two Kubernetes clusters (this guide uses EKS 1.29+ as trust domain
payments.awsand OpenShift 4.14+ aspayments.dc) with cluster-admin, plus one or more Linux VMs for non-Kubernetes workloads. kubectl1.29+,helm3.14+, thespire-serverCLI, andocfor OpenShift, all on your workstation.- A storage backend for SPIRE server state: this guide uses an RDS PostgreSQL instance for the AWS domain (the default in-memory datastore loses all registrations on restart — unacceptable in production).
- HashiCorp Vault reachable from both clusters (we use it as an upstream signing authority so SPIRE’s CA chains to your own root).
- Terraform 1.7+ for the AWS/RDS plumbing, and a CI runner (GitHub Actions or Jenkins) for registration-entry automation.
- DNS or load-balancer reachability between the two SPIRE servers’ bundle endpoints (federation needs an HTTPS path between domains).
Target topology
Two trust domains, each fully self-contained, then bridged. Trust domain payments.aws runs a SPIRE server on the EKS cluster, backed by RDS for its registration datastore and chaining its CA up to HashiCorp Vault’s PKI engine so every SVID it signs rolls up to your corporate root. A SPIRE agent runs as a DaemonSet on every EKS node and attests itself to the server using the AWS IID (Instance Identity Document) — proof, signed by AWS, of which instance it is. Trust domain payments.dc runs the mirror image on OpenShift, with its own server, agent DaemonSet, and Vault-backed CA. The bare-metal Postgres and broker VMs join payments.dc by running a SPIRE agent as a systemd service, attested by a one-time join token.
The bridge is SPIFFE federation: each server publishes its trust bundle (its set of CA roots) at an HTTPS bundle endpoint, and each is configured to fetch and trust the other’s bundle on a schedule. Once federated, the checkout service in payments.aws and the ledger service in payments.dc can complete a mutual-TLS handshake — each presents its X.509-SVID, each validates the peer against the correct trust domain’s bundle — with zero shared secrets and certificates that live for an hour. Around this core, Okta/Entra ID still governs human access to the SPIRE servers and dashboards (SPIFFE is for workloads, not people), Wiz / Wiz Code scans the manifests and runtime posture, CrowdStrike Falcon watches the nodes, Dynatrace/Datadog ingests SVID issuance and rotation metrics, Argo CD delivers the manifests, and ServiceNow carries the change approvals.
1. Provision the datastore and upstream CA with Terraform
SPIRE’s server is stateless compute but stateful data — registration entries, attested nodes, and the CA journal live in its datastore. Start with the AWS domain’s Postgres datastore and an IAM role the server will assume for node attestation. Keep this in your existing infra repo; do not invent a new secret store.
# spire-aws.tf
resource "aws_db_instance" "spire" {
identifier = "spire-server-payments-aws"
engine = "postgres"
engine_version = "16.3"
instance_class = "db.t3.medium"
allocated_storage = 50
db_name = "spire"
username = "spire"
manage_master_user_password = true # password lives in Secrets Manager, never in tfstate
multi_az = true # SPIRE server HA needs a durable datastore
storage_encrypted = true
vpc_security_group_ids = [aws_security_group.spire_db.id]
db_subnet_group_name = aws_db_subnet_group.private.name
}
# IRSA role the SPIRE server assumes to read EC2 instance metadata for node attestation
resource "aws_iam_role" "spire_server" {
name = "spire-server-payments-aws"
assume_role_policy = data.aws_iam_policy_document.spire_irsa_assume.json
}
resource "aws_iam_role_policy" "spire_server_ec2" {
role = aws_iam_role.spire_server.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["ec2:DescribeInstances"] # validates the IID the agent presents
Resource = "*"
}]
})
}
terraform init && terraform apply -auto-approve
# capture the RDS endpoint and IRSA role ARN for the Helm values below
terraform output -raw spire_db_endpoint
terraform output -raw spire_server_role_arn
For the upstream CA, enable a PKI secrets engine in HashiCorp Vault and create a role SPIRE will use to get its intermediate signing certificate. This is what makes every SVID chain to your corporate root instead of a self-signed SPIRE CA — auditors and Wiz both want that chain.
vault secrets enable -path=pki_spire_aws pki
vault secrets tune -max-lease-ttl=87600h pki_spire_aws
# import or generate the intermediate that SPIRE's UpstreamAuthority will sign with
vault write pki_spire_aws/roles/spire-server \
allow_any_name=true enforce_hostnames=false \
key_type=ec key_bits=256 ttl=24h max_ttl=48h
# an AppRole SPIRE authenticates with (token mounted into the server pod via Vault Agent)
vault write auth/approle/role/spire-server \
token_policies="spire-pki" token_ttl=1h token_max_ttl=4h
2. Deploy the SPIRE server on EKS
Use the official spiffe/spire Helm chart. The values below pin the trust domain, point the datastore at RDS, enable the AWS IID node-attestation plugin, and wire the Vault UpstreamAuthority. The trust domain name is permanent — it is baked into every SPIFFE ID you ever issue — so choose it deliberately.
# values-spire-server-aws.yaml
global:
spire:
trustDomain: payments.aws
clusterName: eks-payments
spire-server:
ca:
keyType: ec-p256
dataStore:
sql:
databaseType: postgres
connectionString: "dbname=spire user=spire host=<RDS_ENDPOINT> sslmode=require"
nodeAttestor:
aws_iid:
enabled: true # nodes prove identity with the AWS-signed Instance Identity Document
upstreamAuthority:
vault:
enabled: true
address: "https://vault.payments.internal:8200"
pkiMountPoint: "pki_spire_aws"
caCertPath: "/run/secrets/vault-ca.pem"
approleAuth:
roleIDFilePath: "/run/secrets/role-id"
secretIDFilePath: "/run/secrets/secret-id"
controllerManager:
enabled: true # the CRD-driven way to declare registration entries
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "<SPIRE_SERVER_ROLE_ARN>" # IRSA for ec2:DescribeInstances
helm repo add spiffe https://spiffe.github.io/helm-charts-hardened
helm repo update
helm upgrade --install spire spiffe/spire \
--namespace spire-server --create-namespace \
-f values-spire-server-aws.yaml --wait
# confirm the server is up and its CA chained to Vault
kubectl -n spire-server exec deploy/spire-server -- \
/opt/spire/bin/spire-server healthcheck
kubectl -n spire-server exec deploy/spire-server -- \
/opt/spire/bin/spire-server bundle show -format pem | openssl x509 -noout -issuer
# issuer should be your Vault intermediate, NOT a SPIRE self-signed CA
3. Deploy SPIRE agents and the Workload API
The agent runs as a DaemonSet so every node has one. It attests the node to the server (AWS IID again), then exposes the SPIFFE Workload API over a Unix domain socket that pods mount read-only. The chart deploys this when you enable spire-agent; the key piece is the CSI driver that injects the socket into workload pods without those pods needing any credential.
# values-spire-agent-aws.yaml
spire-agent:
nodeAttestor:
aws_iid:
enabled: true
workloadAttestors:
k8s:
enabled: true # binds an SVID to a pod's service account + namespace
unix:
enabled: true # for non-k8s processes on the node
spiffe-csi-driver:
enabled: true # mounts the Workload API socket into pods read-only
helm upgrade --install spire spiffe/spire \
--namespace spire-server -f values-spire-server-aws.yaml \
-f values-spire-agent-aws.yaml --wait
# every node should show an attested agent
kubectl -n spire-server exec deploy/spire-server -- \
/opt/spire/bin/spire-server agent list
A workload consumes its identity by mounting the CSI volume — no Secret, no token file you have to rotate:
# in the checkout Deployment's pod spec
volumes:
- name: spiffe-workload-api
csi:
driver: csi.spiffe.io
readOnly: true
containers:
- name: checkout
volumeMounts:
- name: spiffe-workload-api
mountPath: /spiffe-workload-api
readOnly: true
env:
- name: SPIFFE_ENDPOINT_SOCKET
value: unix:///spiffe-workload-api/spire-agent.sock
4. Register workloads with the controller manager
A registration entry is the rule that says “a pod with service account checkout in namespace api, running on an attested node, gets SPIFFE ID spiffe://payments.aws/ns/api/sa/checkout.” With the SPIRE Controller Manager enabled in step 2, you declare these as ClusterSPIFFEID CRDs — which means Argo CD delivers them as ordinary manifests and they live in Git, reviewed like any other change. This is the part teams most want automated; never hand-create entries with the CLI in production.
# checkout-spiffeid.yaml — managed in Git, synced by Argo CD
apiVersion: spire.spiffe.io/v1alpha1
kind: ClusterSPIFFEID
metadata:
name: checkout
spec:
spiffeIDTemplate: "spiffe://payments.aws/ns/{{ .PodMeta.Namespace }}/sa/{{ .PodSpec.ServiceAccountName }}"
podSelector:
matchLabels:
app: checkout
federatesWith:
- payments.dc # this SVID may be presented to the OpenShift trust domain
dnsNameTemplates:
- "checkout.api.svc.cluster.local"
ttl: "1h" # short-lived by design; the agent rotates well before expiry
kubectl apply -f checkout-spiffeid.yaml
# verify the entry materialised
kubectl -n spire-server exec deploy/spire-server -- \
/opt/spire/bin/spire-server entry show -spiffeID spiffe://payments.aws/ns/api/sa/checkout
For workloads you cannot label-select (a CI job that needs to call an internal service, for instance), have your pipeline create the entry through the server API. A small GitHub Actions step keeps the registration in lockstep with deploys:
# .github/workflows/register-svid.yml (excerpt)
- name: Register batch-runner SVID
run: |
kubectl -n spire-server exec deploy/spire-server -- \
/opt/spire/bin/spire-server entry create \
-parentID spiffe://payments.aws/spire/agent/aws_iid/ACCOUNT/REGION/sg-xyz \
-spiffeID spiffe://payments.aws/ns/batch/sa/runner \
-selector k8s:ns:batch -selector k8s:sa:runner -ttl 3600
5. Stand up the second trust domain on OpenShift
Repeat steps 2–4 on the OpenShift cluster with trust domain payments.dc, a separate Vault PKI mount (pki_spire_dc), and the OpenShift-appropriate node attestor. On bare OpenShift without a cloud IID, use the k8s_psat (Projected Service Account Token) node attestor, and grant the agent’s SCC the host-path access the CSI driver needs.
# values-spire-server-dc.yaml (the deltas that matter)
global:
spire:
trustDomain: payments.dc
clusterName: ocp-payments
spire-server:
nodeAttestor:
k8s_psat:
enabled: true # OpenShift has no AWS IID; attest via projected SA tokens
upstreamAuthority:
vault:
enabled: true
pkiMountPoint: "pki_spire_dc"
oc new-project spire-server
# the CSI driver and agent need hostPath + privileged socket access on OpenShift
oc adm policy add-scc-to-user privileged -z spire-agent -n spire-server
helm upgrade --install spire spiffe/spire -n spire-server \
-f values-spire-server-dc.yaml -f values-spire-agent-dc.yaml --wait
The virtual-appliance VMs (Postgres, broker) join payments.dc by running the agent as a systemd unit, attested with a one-time join token the server mints:
# on the SPIRE server (payments.dc), mint a single-use token for the DB VM
oc -n spire-server exec deploy/spire-server -- \
/opt/spire/bin/spire-server token generate \
-spiffeID spiffe://payments.dc/node/pg-primary
# on the Postgres VM
sudo spire-agent run -config /etc/spire/agent.conf \
-joinToken <TOKEN_FROM_ABOVE>
# register the database process itself, selected by its Unix uid + path
oc -n spire-server exec deploy/spire-server -- \
/opt/spire/bin/spire-server entry create \
-parentID spiffe://payments.dc/node/pg-primary \
-spiffeID spiffe://payments.dc/svc/ledger-db \
-selector unix:uid:999 -selector unix:path:/usr/lib/postgresql/16/bin/postgres \
-federatesWith payments.aws -ttl 3600
6. Federate the two trust domains
Now bridge them. Each server exposes a bundle endpoint (an HTTPS service publishing its CA roots), and each is told the other’s endpoint plus the SPIFFE ID that endpoint authenticates as. This is the single step that turns two isolated identity islands into a mesh — and it uses no shared secret: the bundle endpoint is authenticated by SPIFFE itself (https_spiffe profile).
Add federation config to both servers’ Helm values:
# add to values-spire-server-aws.yaml
spire-server:
federation:
enabled: true
bundleEndpoint:
port: 8443
federatesWith:
payments.dc:
bundleEndpointURL: "https://spire-dc.payments.internal:8443"
bundleEndpointProfile:
type: https_spiffe
endpointSPIFFEID: spiffe://payments.dc/spire/server
Bootstrap is a chicken-and-egg: before either side has fetched the other’s bundle, you must seed each with the peer’s initial bundle once, by hand.
# export each domain's bundle, then import it into the other server
kubectl -n spire-server exec deploy/spire-server -- \
/opt/spire/bin/spire-server bundle show -format spiffe > aws.bundle
oc -n spire-server exec deploy/spire-server -- \
/opt/spire/bin/spire-server bundle show -format spiffe > dc.bundle
# seed AWS with DC's bundle (and vice-versa)
kubectl -n spire-server exec -i deploy/spire-server -- \
/opt/spire/bin/spire-server bundle set -format spiffe -id spiffe://payments.dc < dc.bundle
oc -n spire-server exec -i deploy/spire-server -- \
/opt/spire/bin/spire-server bundle set -format spiffe -id spiffe://payments.aws < aws.bundle
After this one-time seed, each server refreshes the peer bundle automatically from the bundle endpoint, so CA rotations on either side propagate without intervention. Mark checkout (AWS) and ledger (DC) federatesWith each other as shown in steps 4–5, and the cross-domain handshake is now possible.
7. Enforce mTLS in the application
SVIDs are useless until something requires them. You have two paths. If you run a service mesh, Istio’s SPIRE integration makes Envoy fetch SVIDs from the Workload API and enforce STRICT mTLS with no app changes:
# PeerAuthentication: refuse any non-mTLS traffic
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
name: default
namespace: api
spec:
mtls:
mode: STRICT
If you do not run a mesh — likely for the VM-based ledger DB — use the SPIFFE helper libraries directly. The go-spiffe/v2 library fetches and auto-rotates the SVID and gives you a TLS config that validates the peer’s SPIFFE ID:
// the checkout service dialling the ledger across trust domains
source, _ := workloadapi.NewX509Source(ctx) // streams + rotates the SVID from the agent socket
defer source.Close()
// only accept a peer whose SPIFFE ID is the ledger in the federated domain
tlsConfig := tlsconfig.MTLSClientConfig(source, source,
tlsconfig.AuthorizeID(spiffeid.RequireFromString("spiffe://payments.dc/svc/ledger")))
client := &http.Client{Transport: &http.Transport{TLSClientConfig: tlsConfig}}
resp, _ := client.Get("https://ledger.payments.dc:8443/balances")
The server side calls tlsconfig.MTLSServerConfig with an AuthorizeID for spiffe://payments.aws/ns/api/sa/checkout. Now the connection only succeeds if both peers present valid, federated, non-expired SVIDs — and the shared bearer token has nothing left to do.
Validation
Prove the whole chain end to end, not just that pods are running.
# 1. A workload can actually fetch its SVID from the agent
kubectl -n api exec deploy/checkout -- \
/opt/spire/bin/spire-agent api fetch x509 \
-socketPath /spiffe-workload-api/spire-agent.sock
# expect an X509-SVID with SPIFFE ID spiffe://payments.aws/ns/api/sa/checkout and a ~1h expiry
# 2. The federated bundle is present on each side
kubectl -n spire-server exec deploy/spire-server -- \
/opt/spire/bin/spire-server bundle list
# expect to see spiffe://payments.dc listed (and the mirror on OpenShift)
# 3. The cross-domain mTLS call returns 200 and the wire is encrypted
kubectl -n api exec deploy/checkout -- curl -so /dev/null -w '%{http_code}\n' \
https://ledger.payments.dc:8443/balances # -> 200
# 4. Negative test: a workload WITHOUT a registration entry is denied
kubectl -n api run rogue --image=curlimages/curl --restart=Never -- \
curl -s https://ledger.payments.dc:8443/balances # -> TLS handshake failure, by design
# 5. Watch rotation happen — the cert serial changes well before the 1h TTL
watch -n 30 "kubectl -n api exec deploy/checkout -- \
/opt/spire/bin/spire-agent api fetch x509 \
-socketPath /spiffe-workload-api/spire-agent.sock | grep -i 'serial\|after'"
Pipe SVID issuance, rotation, and Workload API latency to Dynatrace or Datadog — SPIRE exposes Prometheus metrics on the server and agents (spire_server_ca_manager_rotate, spire_agent_svid_rotation). Alert on a stalled rotation: a workload still serving a cert inside its last 15 minutes of validity means the agent or server is unhealthy, and you want that page before the SVID expires and traffic drops.
Rollback / teardown
Roll back in reverse dependency order; the application enforcement is the safe thing to relax first.
# 1. Drop mTLS enforcement back to PERMISSIVE so existing traffic survives the rollback
kubectl -n api patch peerauthentication default --type merge \
-p '{"spec":{"mtls":{"mode":"PERMISSIVE"}}}'
# 2. Remove federation so the domains stop trusting each other
kubectl -n spire-server exec deploy/spire-server -- \
/opt/spire/bin/spire-server bundle delete -id spiffe://payments.dc
# (mirror on OpenShift for spiffe://payments.aws)
# 3. Remove registration entries (Argo CD: delete the ClusterSPIFFEID manifests from Git instead)
kubectl delete -f checkout-spiffeid.yaml
# 4. Tear down SPIRE itself
helm uninstall spire -n spire-server
kubectl delete ns spire-server
# 5. On the VMs, stop and disable the agent
sudo systemctl disable --now spire-agent
The RDS datastore and the Vault PKI mounts are managed by Terraform; destroy them only after confirming no other domain depends on the shared Vault root (terraform destroy -target=aws_db_instance.spire). Route any teardown of a production trust domain through a ServiceNow change request — pulling federation mid-flight breaks every cross-cluster call, so it needs an approved window.
Common pitfalls
- Choosing a throwaway trust-domain name.
payments.awsis forever — it is embedded in every SVID and every federation relationship. Renaming means re-attesting every workload and re-seeding every bundle. Decide the naming scheme (we use<org>.<environment-or-location>) before the firsthelm install. - Leaving the in-memory datastore. The chart’s default datastore is non-persistent; a server restart wipes every registration and every node has to re-attest. Always back the server with Postgres/RDS in production, as in step 1.
- Forgetting the federation bootstrap seed. Federation cannot fetch the peer bundle until it already trusts the peer’s bundle endpoint — a bootstrap paradox. You must do the one-time manual
bundle set(step 6) or the endpoints will reject each other with a TLS error that looks like a network problem but is not. - Skipping
federatesWithon the registration entry. A workload only receives a peer domain’s trust bundle if its entry is explicitly markedfederatesWith. Symptom: the SVID issues fine, but the cross-domain handshake fails because the client cannot validate the server’s chain. This is the single most common “it works same-cluster but not cross-cluster” bug. - OpenShift SCC blocking the CSI socket. Without
privileged(or a tailored SCC) the SPIFFE CSI driver cannot mount the host-path socket, and every workload silently fails to fetch an SVID. Grant it explicitly (step 5). - TTL too long. A 24h SVID defeats the point. Keep
ttl: 1h(or less) so a compromised workload’s credential is useless within the hour and rotation is exercised constantly rather than discovered broken during an incident.
Security notes
The whole exercise is Zero Trust by construction: identity is attested (proven from what a workload is, via AWS IID, k8s PSAT, or join token), credentials are short-lived X.509-SVIDs that rotate automatically, and there is no shared secret to leak — the bearer token that lived on a wiki page is deleted. Keep humans out of band: SPIFFE authenticates workloads, so admin access to the SPIRE servers and any dashboard still goes through Okta or Entra ID with conditional access and MFA, and SPIRE server RBAC is scoped to that SSO. Chain the CA to HashiCorp Vault (step 1) so every SVID rolls up to your audited corporate root rather than an opaque self-signed CA. Run Wiz / Wiz Code over the SPIRE manifests in the pipeline to catch an over-broad podSelector or a missing privileged justification, and over the running clusters for posture drift. Put CrowdStrike Falcon on the nodes — a host compromise is the one way to subvert node attestation, so runtime detection there is the backstop. Treat the bundle endpoints as security-critical surface: they publish your CA roots, so front them with proper TLS and network policy, and an over-permissive Akamai WAF/edge config in front of any externally reachable endpoint is a finding.
Cost notes
SPIRE itself is open source and the per-node footprint is small — the agent DaemonSet and CSI driver are a few hundred MB of memory per node, lost in the noise on most clusters. The real line items are the supporting infrastructure: a Multi-AZ RDS instance per trust domain (here db.t3.medium, a few thousand rupees a month each), the Vault cluster you are presumably already running, and the cross-AZ/cross-region data transfer the mTLS traffic incurs. Two practical levers: right-size the SPIRE server (it is not CPU-hungry — a single replica with a warm standby handles tens of thousands of registrations) and consolidate trust domains rather than minting one per cluster, since each domain is a server, a datastore, and a federation relationship to operate. The saving that justifies the project is not a line on the cloud bill anyway — it is deleting a class of long-lived secrets, the audit findings they generate, and the breach-blast-radius they carry. Capacity-plan the SPIRE server like any stateful service; track its issuance rate in Datadog/Dynatrace so you scale the datastore before SVID latency starts adding to every service-to-service call.