A mid-size SaaS company runs eleven customer-facing apps, an internal Moodle learning portal, and a fleet of admin tools, and every one of them bolted on its own login. After an outage where a single self-hosted Keycloak VM fell over during a database failover and locked 30,000 users out of all of them for forty minutes, the platform team got a clear mandate: stand up one identity provider that is genuinely highly available, survives a node loss without dropping sessions, and stops storing its own database on the same box it runs on. This guide walks through exactly that build — Keycloak on Kubernetes in HA, managed by the Keycloak Operator, backed by an external PostgreSQL you do not babysit, with Infinispan replicating session state across pods so a pod restart never logs anyone out. Everything here is real: real kubectl, real CRDs, real flags.
We target Keycloak 26.x (the operator and server share a version line) on any conformant Kubernetes 1.28+ cluster — EKS, AKS, GKE, or on-prem. The external database is a managed PostgreSQL 16 (Amazon RDS, Azure Database for PostgreSQL Flexible Server, or Cloud SQL); running Postgres outside the cluster is the whole point, so the IdP and its data fail independently.
Prerequisites
- A Kubernetes 1.28+ cluster with at least 3 worker nodes across distinct availability zones, and
kubectl+helmv3 configured against it. - A reachable managed PostgreSQL 16 instance with a dedicated
keycloakdatabase and login role, network path open from the cluster (security group / NSG / VPC peering), and TLS enabled. - cert-manager installed (for the server’s TLS certificate) and an ingress controller or Gateway API implementation (NGINX, or the cloud’s ALB/Application Gateway).
- HashiCorp Vault reachable from the cluster, used here to issue the database credentials and the Keycloak admin bootstrap secret rather than hand-typing passwords into YAML.
- A DNS name you control for the IdP (we use
id.example.com) and the ability to point it at the ingress. - Cluster-admin to install CRDs.
Target topology
Three Keycloak pods run as a StatefulSet the operator manages, spread across three zones by a pod anti-affinity rule. They are stateless on disk — all durable state (realms, users, clients, offline sessions) lives in the external PostgreSQL, and all hot state (online user sessions, auth-code-to-token exchanges, login failures) lives in an Infinispan distributed cache that every pod shares over JGroups, discovering its peers through the Kubernetes API (KUBE_PING). A user whose request was being served by a pod that just died is re-served by another pod that already holds a replica of their session — no re-login. In front of the pods sits an ingress doing TLS passthrough or re-encryption to the IdP; Akamai sits at the public edge for global anycast, TLS termination, and WAF/bot mitigation so credential-stuffing floods are scrubbed before they reach the cluster. Keycloak itself is the broker: it federates upstream to Microsoft Entra ID and Okta as the workforce identity providers (so employees keep their corporate SSO), while issuing its own OIDC/SAML tokens downstream to the eleven apps and Moodle.
1. Install the Keycloak Operator and CRDs
The operator owns two CRDs — Keycloak (the server deployment) and KeycloakRealmImport (declarative realm bootstrap). Install them and the operator into a dedicated namespace.
kubectl create namespace keycloak
# CRDs and operator, pinned to the version line you intend to run
VERSION=26.0.5
kubectl apply -n keycloak -f \
https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${VERSION}/kubernetes/keycloaks.k8s.keycloak.org-v1.yml
kubectl apply -n keycloak -f \
https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${VERSION}/kubernetes/keycloakrealmimports.k8s.keycloak.org-v1.yml
kubectl apply -n keycloak -f \
https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${VERSION}/kubernetes/kubernetes.yml
kubectl rollout status deployment/keycloak-operator -n keycloak --timeout=120s
Confirm the CRDs registered:
kubectl get crd | grep keycloak.org
# keycloakrealmimports.k8s.keycloak.org
# keycloaks.k8s.keycloak.org
2. Provision database and admin credentials from Vault
Do not put passwords in manifests. Have Vault mint them and land them as Kubernetes Secrets. If you run the Vault Secrets Operator or External Secrets, this is declarative; the imperative path below makes the contract explicit. The DB role is created on the managed PostgreSQL out of band (or via Vault’s database secrets engine for short-lived dynamic credentials).
# Pull a dynamic DB credential lease from Vault's database secrets engine
DB_CREDS=$(vault read -format=json database/creds/keycloak-role)
DB_USER=$(echo "$DB_CREDS" | jq -r .data.username)
DB_PASS=$(echo "$DB_CREDS" | jq -r .data.password)
kubectl create secret generic keycloak-db \
-n keycloak \
--from-literal=username="$DB_USER" \
--from-literal=password="$DB_PASS"
# Bootstrap admin (rotate/disable after first login; see Security notes)
ADMIN_PASS=$(vault kv get -field=password secret/keycloak/bootstrap-admin)
kubectl create secret generic keycloak-initial-admin \
-n keycloak \
--from-literal=username=tmpadmin \
--from-literal=password="$ADMIN_PASS"
Stage the database TLS CA so Keycloak verifies the Postgres certificate rather than trusting blindly:
kubectl create configmap db-ca -n keycloak \
--from-file=root.crt=./rds-combined-ca-bundle.pem
3. Create the TLS certificate for the server
cert-manager issues the cert the pods present. Reference your real ClusterIssuer (ACME/Let’s Encrypt, or your internal CA).
# keycloak-tls.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: keycloak-tls
namespace: keycloak
spec:
secretName: keycloak-tls
duration: 2160h # 90d
renewBefore: 360h # 15d
dnsNames:
- id.example.com
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
kubectl apply -f keycloak-tls.yaml
kubectl get certificate -n keycloak keycloak-tls -w # wait for READY=True
4. Deploy the HA Keycloak CR
This is the core object. It declares three replicas, points at the external database, enables the Infinispan cache stack for clustering, mounts the DB CA, and sets the hostname. The operator turns it into a StatefulSet, Services, and the cache configuration.
# keycloak.yaml
apiVersion: k8s.keycloak.org/v2alpha1
kind: Keycloak
metadata:
name: keycloak
namespace: keycloak
spec:
instances: 3
image: quay.io/keycloak/keycloak:26.0.5
db:
vendor: postgres
host: pg-keycloak.abc123.ap-south-1.rds.amazonaws.com
port: 5432
database: keycloak
usernameSecret:
name: keycloak-db
key: username
passwordSecret:
name: keycloak-db
key: password
hostname:
hostname: https://id.example.com
http:
httpEnabled: false
tlsSecret: keycloak-tls
# Distributed session cache across pods -> no logout on pod loss
cache:
configMapFile:
name: keycloak-cache
key: cache-ispn-kubeping.xml
resources:
requests: { cpu: "1", memory: 1500Mi }
limits: { cpu: "2", memory: 2Gi }
additionalOptions:
- name: db-driver
value: org.postgresql.Driver
- name: cache-stack
value: kubernetes # JGroups KUBE_PING discovery
- name: proxy-headers
value: xforwarded # behind ingress/Akamai
- name: hostname-strict
value: "true"
# Spread pods across zones; one extra layer beyond default anti-affinity
unsupported:
podTemplate:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: keycloak
The custom Infinispan configuration enabling Kubernetes peer discovery and replicated sessions/authenticationSessions caches:
kubectl create configmap keycloak-cache -n keycloak \
--from-file=cache-ispn-kubeping.xml=./cache-ispn-kubeping.xml
<!-- cache-ispn-kubeping.xml (abridged: the distributed caches + KUBE_PING) -->
<infinispan>
<jgroups>
<stack name="kubernetes" extends="udp">
<TCP bind_port="7800"/>
<org.jgroups.protocols.kubernetes.KUBE_PING
namespace="keycloak"
labels="app=keycloak"/>
</stack>
</jgroups>
<cache-container name="keycloak">
<transport stack="kubernetes"/>
<distributed-cache name="sessions" owners="2"/>
<distributed-cache name="authenticationSessions" owners="2"/>
<distributed-cache name="offlineSessions" owners="2"/>
<distributed-cache name="loginFailures" owners="2"/>
</cache-container>
</infinispan>
owners="2" keeps two copies of every session entry, so losing one pod never loses a session. Apply the CR and watch the operator build the StatefulSet:
kubectl apply -f keycloak.yaml
kubectl get keycloak keycloak -n keycloak -o jsonpath='{.status.conditions}' | jq
kubectl rollout status statefulset/keycloak -n keycloak --timeout=300s
kubectl get pods -n keycloak -l app=keycloak -o wide # 3 pods, distinct zones/nodes
5. Expose it through the ingress
Front the operator-managed Service (keycloak-service:8443) with your ingress. Use backend re-encryption so traffic stays TLS end to end; Akamai terminates the public TLS and forwards to this origin.
# keycloak-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: keycloak
namespace: keycloak
annotations:
nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
nginx.ingress.kubernetes.io/proxy-buffer-size: "16k" # large OIDC headers
spec:
ingressClassName: nginx
tls:
- hosts: [ id.example.com ]
secretName: keycloak-tls
rules:
- host: id.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: keycloak-service
port: { number: 8443 }
kubectl apply -f keycloak-ingress.yaml
6. Bootstrap a realm and broker upstream IdPs declaratively
Use KeycloakRealmImport so the realm, the downstream app clients, and the Entra ID / Okta identity-provider brokers are version-controlled, not hand-clicked. Keycloak federates up to corporate SSO and issues tokens down to the eleven apps and Moodle.
# realm-import.yaml
apiVersion: k8s.keycloak.org/v2alpha1
kind: KeycloakRealmImport
metadata:
name: corp-realm-import
namespace: keycloak
spec:
keycloakCRName: keycloak
realm:
realm: corp
enabled: true
sslRequired: external
identityProviders:
- alias: entra-id
providerId: oidc # Entra ID as upstream workforce IdP
enabled: true
config:
clientId: "<entra-app-id>"
clientSecret: "<from-vault>"
authorizationUrl: "https://login.microsoftonline.com/<tenant>/oauth2/v2.0/authorize"
tokenUrl: "https://login.microsoftonline.com/<tenant>/oauth2/v2.0/token"
defaultScope: "openid profile email"
- alias: okta
providerId: oidc # Okta as second workforce IdP
enabled: true
config:
clientId: "<okta-client-id>"
clientSecret: "<from-vault>"
authorizationUrl: "https://example.okta.com/oauth2/v1/authorize"
tokenUrl: "https://example.okta.com/oauth2/v1/token"
defaultScope: "openid profile email"
clients:
- clientId: moodle
protocol: openid-connect
redirectUris: [ "https://learn.example.com/*" ]
publicClient: false
kubectl apply -f realm-import.yaml
kubectl get keycloakrealmimport corp-realm-import -n keycloak \
-o jsonpath='{.status.conditions}' | jq
The clientSecret values shown should be sourced from Vault (via the Vault/External Secrets operator) rather than committed — the placeholders are deliberate.
7. Wire CI/CD, IaC, security, and observability
The cluster, the managed PostgreSQL, and the network are Terraform; node OS hardening and the Postgres parameter group are Ansible. The Keycloak CR, realm import, and ingress YAML deploy through Argo CD (GitOps — Argo watches the repo and reconciles the cluster to it), with the manifests built and validated in GitHub Actions (lint, kubeval, and a dry-run apply) before Argo syncs; on-prem clusters use Jenkins for the same gates. Wiz (with Wiz Code scanning the IaC in the pull request) runs CSPM over the cluster and the database to flag a publicly exposed Postgres or an over-broad security group; CrowdStrike Falcon sensors on the node pool give runtime threat detection feeding the SOC. Datadog (or Dynatrace) scrapes Keycloak’s /metrics endpoint and traces login latency, with ServiceNow receiving an auto-raised change request before a new realm or upstream IdP goes live and an incident ticket on any health-probe breach.
# Argo CD application pointing at the GitOps repo path
argocd app create keycloak \
--repo https://git.example.com/platform/keycloak-gitops.git \
--path manifests/keycloak --dest-namespace keycloak \
--dest-server https://kubernetes.default.svc --sync-policy automated
Validation
Prove HA before you trust it.
# 1. Health endpoints on every pod return UP
for p in $(kubectl get pods -n keycloak -l app=keycloak -o name); do
kubectl exec -n keycloak $p -c keycloak -- \
curl -sk https://localhost:9000/health/ready | jq -r .status
done # expect "UP" x3
# 2. The cluster actually formed — JGroups view shows 3 members
kubectl logs -n keycloak keycloak-0 -c keycloak | grep -i "ISPN000094\|view"
# 3. Database connectivity is live (no embedded DB)
kubectl exec -n keycloak keycloak-0 -c keycloak -- \
curl -sk https://localhost:9000/health/live | jq -r .status
# 4. THE session-survival test: log in, capture the cookie, kill the pod
# serving you, then confirm the session still authenticates from another pod.
kubectl delete pod -n keycloak keycloak-0
# Re-hit https://id.example.com with the same session cookie -> still logged in.
Confirm the public OIDC discovery document resolves through the ingress:
curl -s https://id.example.com/realms/corp/.well-known/openid-configuration | jq .issuer
# "https://id.example.com/realms/corp"
Rollback / teardown
The CR is the unit of rollback. To revert a bad config, kubectl apply the previous keycloak.yaml (Argo does this automatically if you revert the git commit). For a full teardown — note the external database and its data are untouched, which is exactly why we externalized it:
kubectl delete keycloakrealmimport corp-realm-import -n keycloak
kubectl delete keycloak keycloak -n keycloak # removes StatefulSet, Services
kubectl delete -f keycloak-ingress.yaml
kubectl delete configmap keycloak-cache db-ca -n keycloak
kubectl delete secret keycloak-db keycloak-initial-admin -n keycloak
# Operator + CRDs last (deleting the CRD deletes all CRs):
kubectl delete -n keycloak -f \
https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/26.0.5/kubernetes/kubernetes.yml
Because all durable state is in PostgreSQL, you can delete every pod and recreate the CR to land back exactly where you were — the realm, users, and clients are intact. Take a pg_dump before any major version bump regardless.
Common pitfalls
hostname-strictmismatch. If thehostnamein the CR does not match the URL users actually reach (because of the ingress rewriting Host or a missingproxy-headers: xforwarded), redirects break and OIDC clients reject the issuer. Setproxy-headersand match the public hostname exactly.- Cache stack defaulting to local. Forget
cache-stack: kubernetes(or the KUBE_PING config) and each pod runs an isolated local cache — logins randomly fail as requests bounce between pods that do not share auth-session state. The JGroups view log line in Validation is the canary. - KUBE_PING RBAC. The pods’ ServiceAccount needs
get/liston pods in the namespace for discovery; the operator sets this, but a restrictive PodSecurity or a custom SA will silently break clustering. - Database connection-pool exhaustion. Three pods each open a pool; size the managed Postgres
max_connections(and any PgBouncer in front) forinstances × pool-sizeplus headroom, or pods fail readiness under load. - DB CA not trusted. Without the mounted
root.crt, enablingsslmode=verify-fullfails the connection; with it disabled you lose TLS verification. Mount the CA (Step 2) and keep verification on. - Single-zone scheduling. Without the
topologySpreadConstraints, all three pods can land in one zone and a zonal outage takes the whole IdP down — defeating the exercise.
Security notes
Disable the temporary bootstrap admin immediately after creating a real administrative account federated through Entra ID — leaving tmpadmin live is the classic Keycloak foothold. Keep httpEnabled: false and sslRequired: external so no plaintext listener exists. All secrets (DB credentials, upstream IdP client secrets, the bootstrap password) are issued by HashiCorp Vault and never live in git; the database credential is a short-lived dynamic lease that Vault rotates. Wiz Code gates the IaC pull request so a publicly exposed Postgres or an open security group is caught before merge, Wiz CSPM verifies it stays that way at runtime, and CrowdStrike Falcon watches the node pool for runtime compromise. Put Akamai’s WAF and bot mitigation in front to absorb the credential-stuffing traffic an internet-facing IdP attracts.
Cost notes
The dominant cost is the managed PostgreSQL, not the pods — size it for the connection load (Step’s pitfall) and use a single multi-AZ instance rather than over-provisioning read replicas Keycloak does not use for its primary path. The three Keycloak pods are modest (1 vCPU / 1.5Gi requested each); resist scaling pod count for availability you already get from Infinispan replication — scale on actual login throughput, validated in Datadog. Externalizing the database means you pay for it independently, but it is the line item that buys the independent-failure property the whole project exists for. Reserve or commit-discount both the database and the node pool, and let ServiceNow change records tie each scale-up to an approved capacity request so growth stays accountable.