Identity Platform

Deploy Keycloak on Kubernetes in HA with the Operator and External PostgreSQL

A mid-size SaaS company runs eleven customer-facing apps, an internal Moodle learning portal, and a fleet of admin tools, and every one of them bolted on its own login. After an outage where a single self-hosted Keycloak VM fell over during a database failover and locked 30,000 users out of all of them for forty minutes, the platform team got a clear mandate: stand up one identity provider that is genuinely highly available, survives a node loss without dropping sessions, and stops storing its own database on the same box it runs on. This guide walks through exactly that build — Keycloak on Kubernetes in HA, managed by the Keycloak Operator, backed by an external PostgreSQL you do not babysit, with Infinispan replicating session state across pods so a pod restart never logs anyone out. Everything here is real: real kubectl, real CRDs, real flags.

We target Keycloak 26.x (the operator and server share a version line) on any conformant Kubernetes 1.28+ cluster — EKS, AKS, GKE, or on-prem. The external database is a managed PostgreSQL 16 (Amazon RDS, Azure Database for PostgreSQL Flexible Server, or Cloud SQL); running Postgres outside the cluster is the whole point, so the IdP and its data fail independently.

Prerequisites

Target topology

Deploy Keycloak on Kubernetes in HA with the Operator and External PostgreSQL — topology

Three Keycloak pods run as a StatefulSet the operator manages, spread across three zones by a pod anti-affinity rule. They are stateless on disk — all durable state (realms, users, clients, offline sessions) lives in the external PostgreSQL, and all hot state (online user sessions, auth-code-to-token exchanges, login failures) lives in an Infinispan distributed cache that every pod shares over JGroups, discovering its peers through the Kubernetes API (KUBE_PING). A user whose request was being served by a pod that just died is re-served by another pod that already holds a replica of their session — no re-login. In front of the pods sits an ingress doing TLS passthrough or re-encryption to the IdP; Akamai sits at the public edge for global anycast, TLS termination, and WAF/bot mitigation so credential-stuffing floods are scrubbed before they reach the cluster. Keycloak itself is the broker: it federates upstream to Microsoft Entra ID and Okta as the workforce identity providers (so employees keep their corporate SSO), while issuing its own OIDC/SAML tokens downstream to the eleven apps and Moodle.

1. Install the Keycloak Operator and CRDs

The operator owns two CRDs — Keycloak (the server deployment) and KeycloakRealmImport (declarative realm bootstrap). Install them and the operator into a dedicated namespace.

kubectl create namespace keycloak

# CRDs and operator, pinned to the version line you intend to run
VERSION=26.0.5
kubectl apply -n keycloak -f \
  https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${VERSION}/kubernetes/keycloaks.k8s.keycloak.org-v1.yml
kubectl apply -n keycloak -f \
  https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${VERSION}/kubernetes/keycloakrealmimports.k8s.keycloak.org-v1.yml
kubectl apply -n keycloak -f \
  https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${VERSION}/kubernetes/kubernetes.yml

kubectl rollout status deployment/keycloak-operator -n keycloak --timeout=120s

Confirm the CRDs registered:

kubectl get crd | grep keycloak.org
# keycloakrealmimports.k8s.keycloak.org
# keycloaks.k8s.keycloak.org

2. Provision database and admin credentials from Vault

Do not put passwords in manifests. Have Vault mint them and land them as Kubernetes Secrets. If you run the Vault Secrets Operator or External Secrets, this is declarative; the imperative path below makes the contract explicit. The DB role is created on the managed PostgreSQL out of band (or via Vault’s database secrets engine for short-lived dynamic credentials).

# Pull a dynamic DB credential lease from Vault's database secrets engine
DB_CREDS=$(vault read -format=json database/creds/keycloak-role)
DB_USER=$(echo "$DB_CREDS" | jq -r .data.username)
DB_PASS=$(echo "$DB_CREDS" | jq -r .data.password)

kubectl create secret generic keycloak-db \
  -n keycloak \
  --from-literal=username="$DB_USER" \
  --from-literal=password="$DB_PASS"

# Bootstrap admin (rotate/disable after first login; see Security notes)
ADMIN_PASS=$(vault kv get -field=password secret/keycloak/bootstrap-admin)
kubectl create secret generic keycloak-initial-admin \
  -n keycloak \
  --from-literal=username=tmpadmin \
  --from-literal=password="$ADMIN_PASS"

Stage the database TLS CA so Keycloak verifies the Postgres certificate rather than trusting blindly:

kubectl create configmap db-ca -n keycloak \
  --from-file=root.crt=./rds-combined-ca-bundle.pem

3. Create the TLS certificate for the server

cert-manager issues the cert the pods present. Reference your real ClusterIssuer (ACME/Let’s Encrypt, or your internal CA).

# keycloak-tls.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: keycloak-tls
  namespace: keycloak
spec:
  secretName: keycloak-tls
  duration: 2160h      # 90d
  renewBefore: 360h    # 15d
  dnsNames:
    - id.example.com
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
kubectl apply -f keycloak-tls.yaml
kubectl get certificate -n keycloak keycloak-tls -w   # wait for READY=True

4. Deploy the HA Keycloak CR

This is the core object. It declares three replicas, points at the external database, enables the Infinispan cache stack for clustering, mounts the DB CA, and sets the hostname. The operator turns it into a StatefulSet, Services, and the cache configuration.

# keycloak.yaml
apiVersion: k8s.keycloak.org/v2alpha1
kind: Keycloak
metadata:
  name: keycloak
  namespace: keycloak
spec:
  instances: 3
  image: quay.io/keycloak/keycloak:26.0.5

  db:
    vendor: postgres
    host: pg-keycloak.abc123.ap-south-1.rds.amazonaws.com
    port: 5432
    database: keycloak
    usernameSecret:
      name: keycloak-db
      key: username
    passwordSecret:
      name: keycloak-db
      key: password

  hostname:
    hostname: https://id.example.com

  http:
    httpEnabled: false
  tlsSecret: keycloak-tls

  # Distributed session cache across pods -> no logout on pod loss
  cache:
    configMapFile:
      name: keycloak-cache
      key: cache-ispn-kubeping.xml

  resources:
    requests: { cpu: "1",   memory: 1500Mi }
    limits:   { cpu: "2",   memory: 2Gi }

  additionalOptions:
    - name: db-driver
      value: org.postgresql.Driver
    - name: cache-stack
      value: kubernetes            # JGroups KUBE_PING discovery
    - name: proxy-headers
      value: xforwarded            # behind ingress/Akamai
    - name: hostname-strict
      value: "true"

  # Spread pods across zones; one extra layer beyond default anti-affinity
  unsupported:
    podTemplate:
      spec:
        topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                app: keycloak

The custom Infinispan configuration enabling Kubernetes peer discovery and replicated sessions/authenticationSessions caches:

kubectl create configmap keycloak-cache -n keycloak \
  --from-file=cache-ispn-kubeping.xml=./cache-ispn-kubeping.xml
<!-- cache-ispn-kubeping.xml (abridged: the distributed caches + KUBE_PING) -->
<infinispan>
  <jgroups>
    <stack name="kubernetes" extends="udp">
      <TCP bind_port="7800"/>
      <org.jgroups.protocols.kubernetes.KUBE_PING
          namespace="keycloak"
          labels="app=keycloak"/>
    </stack>
  </jgroups>
  <cache-container name="keycloak">
    <transport stack="kubernetes"/>
    <distributed-cache name="sessions"               owners="2"/>
    <distributed-cache name="authenticationSessions" owners="2"/>
    <distributed-cache name="offlineSessions"        owners="2"/>
    <distributed-cache name="loginFailures"          owners="2"/>
  </cache-container>
</infinispan>

owners="2" keeps two copies of every session entry, so losing one pod never loses a session. Apply the CR and watch the operator build the StatefulSet:

kubectl apply -f keycloak.yaml
kubectl get keycloak keycloak -n keycloak -o jsonpath='{.status.conditions}' | jq
kubectl rollout status statefulset/keycloak -n keycloak --timeout=300s
kubectl get pods -n keycloak -l app=keycloak -o wide   # 3 pods, distinct zones/nodes

5. Expose it through the ingress

Front the operator-managed Service (keycloak-service:8443) with your ingress. Use backend re-encryption so traffic stays TLS end to end; Akamai terminates the public TLS and forwards to this origin.

# keycloak-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: keycloak
  namespace: keycloak
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
    nginx.ingress.kubernetes.io/proxy-buffer-size: "16k"   # large OIDC headers
spec:
  ingressClassName: nginx
  tls:
    - hosts: [ id.example.com ]
      secretName: keycloak-tls
  rules:
    - host: id.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: keycloak-service
                port: { number: 8443 }
kubectl apply -f keycloak-ingress.yaml

6. Bootstrap a realm and broker upstream IdPs declaratively

Use KeycloakRealmImport so the realm, the downstream app clients, and the Entra ID / Okta identity-provider brokers are version-controlled, not hand-clicked. Keycloak federates up to corporate SSO and issues tokens down to the eleven apps and Moodle.

# realm-import.yaml
apiVersion: k8s.keycloak.org/v2alpha1
kind: KeycloakRealmImport
metadata:
  name: corp-realm-import
  namespace: keycloak
spec:
  keycloakCRName: keycloak
  realm:
    realm: corp
    enabled: true
    sslRequired: external
    identityProviders:
      - alias: entra-id
        providerId: oidc          # Entra ID as upstream workforce IdP
        enabled: true
        config:
          clientId: "<entra-app-id>"
          clientSecret: "<from-vault>"
          authorizationUrl: "https://login.microsoftonline.com/<tenant>/oauth2/v2.0/authorize"
          tokenUrl: "https://login.microsoftonline.com/<tenant>/oauth2/v2.0/token"
          defaultScope: "openid profile email"
      - alias: okta
        providerId: oidc          # Okta as second workforce IdP
        enabled: true
        config:
          clientId: "<okta-client-id>"
          clientSecret: "<from-vault>"
          authorizationUrl: "https://example.okta.com/oauth2/v1/authorize"
          tokenUrl: "https://example.okta.com/oauth2/v1/token"
          defaultScope: "openid profile email"
    clients:
      - clientId: moodle
        protocol: openid-connect
        redirectUris: [ "https://learn.example.com/*" ]
        publicClient: false
kubectl apply -f realm-import.yaml
kubectl get keycloakrealmimport corp-realm-import -n keycloak \
  -o jsonpath='{.status.conditions}' | jq

The clientSecret values shown should be sourced from Vault (via the Vault/External Secrets operator) rather than committed — the placeholders are deliberate.

7. Wire CI/CD, IaC, security, and observability

The cluster, the managed PostgreSQL, and the network are Terraform; node OS hardening and the Postgres parameter group are Ansible. The Keycloak CR, realm import, and ingress YAML deploy through Argo CD (GitOps — Argo watches the repo and reconciles the cluster to it), with the manifests built and validated in GitHub Actions (lint, kubeval, and a dry-run apply) before Argo syncs; on-prem clusters use Jenkins for the same gates. Wiz (with Wiz Code scanning the IaC in the pull request) runs CSPM over the cluster and the database to flag a publicly exposed Postgres or an over-broad security group; CrowdStrike Falcon sensors on the node pool give runtime threat detection feeding the SOC. Datadog (or Dynatrace) scrapes Keycloak’s /metrics endpoint and traces login latency, with ServiceNow receiving an auto-raised change request before a new realm or upstream IdP goes live and an incident ticket on any health-probe breach.

# Argo CD application pointing at the GitOps repo path
argocd app create keycloak \
  --repo https://git.example.com/platform/keycloak-gitops.git \
  --path manifests/keycloak --dest-namespace keycloak \
  --dest-server https://kubernetes.default.svc --sync-policy automated

Validation

Prove HA before you trust it.

# 1. Health endpoints on every pod return UP
for p in $(kubectl get pods -n keycloak -l app=keycloak -o name); do
  kubectl exec -n keycloak $p -c keycloak -- \
    curl -sk https://localhost:9000/health/ready | jq -r .status
done   # expect "UP" x3

# 2. The cluster actually formed — JGroups view shows 3 members
kubectl logs -n keycloak keycloak-0 -c keycloak | grep -i "ISPN000094\|view"

# 3. Database connectivity is live (no embedded DB)
kubectl exec -n keycloak keycloak-0 -c keycloak -- \
  curl -sk https://localhost:9000/health/live | jq -r .status

# 4. THE session-survival test: log in, capture the cookie, kill the pod
#    serving you, then confirm the session still authenticates from another pod.
kubectl delete pod -n keycloak keycloak-0
#    Re-hit https://id.example.com with the same session cookie -> still logged in.

Confirm the public OIDC discovery document resolves through the ingress:

curl -s https://id.example.com/realms/corp/.well-known/openid-configuration | jq .issuer
# "https://id.example.com/realms/corp"

Rollback / teardown

The CR is the unit of rollback. To revert a bad config, kubectl apply the previous keycloak.yaml (Argo does this automatically if you revert the git commit). For a full teardown — note the external database and its data are untouched, which is exactly why we externalized it:

kubectl delete keycloakrealmimport corp-realm-import -n keycloak
kubectl delete keycloak keycloak -n keycloak          # removes StatefulSet, Services
kubectl delete -f keycloak-ingress.yaml
kubectl delete configmap keycloak-cache db-ca -n keycloak
kubectl delete secret keycloak-db keycloak-initial-admin -n keycloak
# Operator + CRDs last (deleting the CRD deletes all CRs):
kubectl delete -n keycloak -f \
  https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/26.0.5/kubernetes/kubernetes.yml

Because all durable state is in PostgreSQL, you can delete every pod and recreate the CR to land back exactly where you were — the realm, users, and clients are intact. Take a pg_dump before any major version bump regardless.

Common pitfalls

Security notes

Disable the temporary bootstrap admin immediately after creating a real administrative account federated through Entra ID — leaving tmpadmin live is the classic Keycloak foothold. Keep httpEnabled: false and sslRequired: external so no plaintext listener exists. All secrets (DB credentials, upstream IdP client secrets, the bootstrap password) are issued by HashiCorp Vault and never live in git; the database credential is a short-lived dynamic lease that Vault rotates. Wiz Code gates the IaC pull request so a publicly exposed Postgres or an open security group is caught before merge, Wiz CSPM verifies it stays that way at runtime, and CrowdStrike Falcon watches the node pool for runtime compromise. Put Akamai’s WAF and bot mitigation in front to absorb the credential-stuffing traffic an internet-facing IdP attracts.

Cost notes

The dominant cost is the managed PostgreSQL, not the pods — size it for the connection load (Step’s pitfall) and use a single multi-AZ instance rather than over-provisioning read replicas Keycloak does not use for its primary path. The three Keycloak pods are modest (1 vCPU / 1.5Gi requested each); resist scaling pod count for availability you already get from Infinispan replication — scale on actual login throughput, validated in Datadog. Externalizing the database means you pay for it independently, but it is the line item that buys the independent-failure property the whole project exists for. Reserve or commit-discount both the database and the node pool, and let ServiceNow change records tie each scale-up to an approved capacity request so growth stays accountable.

KeycloakKubernetesPostgreSQLInfinispanHigh AvailabilityIdentity
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading