Deploy Confluent Platform for Apache Kafka on Kubernetes with the Confluent Operator

A retail-logistics company is drowning in point-to-point integrations: the warehouse management system pokes the order service over a brittle REST call, the fraud team scrapes a read replica every five minutes, and the analytics team’s nightly batch is always six hours stale by the time anyone looks at it. The platform team’s mandate is to put a real event backbone in the middle — every parcel scan, every order state change, every inventory delta published once to Apache Kafka and consumed by anyone who needs it, in order, durably, with a schema contract that stops a producer from silently breaking ten downstream consumers. They already run everything on Kubernetes, so a managed cloud Kafka is off the table for the regulated on-prem workloads; they need Kafka in their own cluster, operated like a first-class platform service rather than a pet. This guide walks through deploying that backbone with Confluent for Kubernetes (CFK) — the official Confluent operator — running a three-broker (KRaft) Kafka cluster, Schema Registry, and Kafka Connect, all secured with TLS and RBAC, and slotted into the enterprise’s existing identity, secrets, GitOps, and observability tooling.

Prerequisites

A Kubernetes cluster, v1.27+, with at least three worker nodes spread across availability zones (so the three KRaft controllers and three brokers land in different failure domains). 16 GiB RAM and 4 vCPU per node is a sane floor for a non-toy cluster.
A default StorageClass that provisions block storage with volumeBindingMode: WaitForFirstConsumer and allowVolumeExpansion: true (e.g. gp3 on EKS, managed-csi on AKS, pd-ssd on GKE). Kafka wants low-latency persistent disks, not NFS.
kubectl v1.27+, helm v3.12+, and the confluent CLI v3+ on your workstation.
A Confluent Platform license (a 30-day trial license ships built in; production needs a license secret). Confluent Platform 7.7 / CFK 2.9 is assumed below.
A CA you can issue certificates from — here, HashiCorp Vault’s PKI secrets engine acts as the internal CA so every Kafka listener gets a short-lived, auto-renewed TLS cert instead of a hand-rolled self-signed chain.
An OIDC identity provider for human and CI access — Microsoft Entra ID is the workforce IdP, federated from Okta as the upstream SSO so engineers log into Kafka tooling with the same corporate identity they use everywhere else.
Cluster-admin on the target namespace, and a Git repository the platform team controls (the operator and all CRs are delivered through Argo CD, never kubectl apply by hand in production).

Target topology

Deploy Confluent Platform for Apache Kafka on Kubernetes with the Confluent Operator — topology

Everything lives in a single namespace, confluent. CFK — the operator — runs as a Deployment and watches a set of custom resources (Kafka, KRaftController, SchemaRegistry, Connect, KafkaTopic, KafkaRestClass, ConfluentRolebinding). You declare what you want as YAML; the operator reconciles brokers, controllers, StatefulSets, Services, and certificates to match. The three KRaft controllers form the metadata quorum (replacing the old ZooKeeper ensemble), the three brokers form the data plane, Schema Registry enforces schema contracts on the wire, and Kafka Connect runs source/sink connectors. Inter-component traffic is mTLS; client traffic is TLS with RBAC. Producers and consumers reach the cluster through a bootstrap Service; external systems reach it through a dedicated external listener.

The supporting cast around the cluster is what turns it from a demo into a platform:

HashiCorp Vault is the internal CA (PKI engine) and the store for the Connect connector credentials and the license secret — pulled in by the Vault Agent injector so nothing sensitive sits in a plain Kubernetes Secret.
Microsoft Entra ID (federated from Okta) is the OIDC provider behind RBAC, so a human’s group membership maps to Kafka principals and role bindings.
Argo CD delivers the operator and every custom resource via GitOps; GitHub Actions runs the schema-compatibility and config-lint checks before a change is allowed to merge.
Terraform provisions the cluster, node pools, StorageClasses, and Vault PKI mount; Ansible handles any node-level tuning (file-descriptor limits, vm.max_map_count) the managed node image does not already set.
Dynatrace (or Datadog) scrapes the JMX/Prometheus metrics and the OneAgent gives per-broker and per-topic telemetry; CrowdStrike Falcon runs as a node sensor for runtime threat detection on the Kafka nodes.
Wiz (and Wiz Code in the pipeline) does posture and IaC scanning so a broker listener never drifts to plaintext or a public LoadBalancer without an alert.
ServiceNow is the change gate — a new external listener or an RBAC role binding for a third party goes through a change request before Argo CD is allowed to sync it.
Akamai sits at the edge only for the operator-adjacent HTTP surfaces (the Control Center UI, the Schema Registry REST endpoint when exposed to partners), terminating TLS and providing WAF in front of those.

1. Provision the cluster prerequisites with Terraform

Stand up (or reuse) the Kubernetes cluster, a fast StorageClass, and the Vault PKI mount with Terraform so the substrate is reproducible. The Kafka-specific pieces are the StorageClass and the Vault PKI role; the cluster itself is whatever your platform already uses.

# storageclass.tf — fast block storage, expandable, late-bound to the AZ the pod lands in
resource "kubernetes_storage_class" "kafka_fast" {
  metadata { name = "kafka-fast" }
  storage_provisioner    = "ebs.csi.aws.com"
  reclaim_policy         = "Retain"          # never auto-delete broker data
  volume_binding_mode    = "WaitForFirstConsumer"
  allow_volume_expansion = true
  parameters = {
    type       = "gp3"
    iops       = "6000"
    throughput = "250"
    encrypted  = "true"
  }
}

# vault-pki.tf — internal CA that issues short-lived Kafka listener certs
resource "vault_mount" "pki_kafka" {
  path        = "pki_kafka"
  type        = "pki"
  max_lease_ttl_seconds = 7776000            # 90d issuing-CA lifetime
}

resource "vault_pki_secret_backend_role" "kafka" {
  backend          = vault_mount.pki_kafka.path
  name             = "kafka-internal"
  allowed_domains  = ["confluent.svc.cluster.local", "kafka.internal.acme.com"]
  allow_subdomains = true
  max_ttl          = "720h"                  # 30d leaf certs, auto-renewed
  key_type         = "rsa"
  key_bits         = 2048
}

terraform init && terraform apply -auto-approve
kubectl get storageclass kafka-fast   # confirm it exists and is the intended class

Run wiz-code iac scan . in the pipeline at this point — Wiz Code flags an unencrypted volume, a reclaim_policy of Delete on stateful storage, or a public endpoint before any of it reaches the cluster.

2. Install the Confluent for Kubernetes operator

Add the Confluent Helm repo and install CFK into the confluent namespace. The operator is cluster-scoped here so it can manage future namespaces, but the workload runs in confluent.

kubectl create namespace confluent

helm repo add confluentinc https://packages.confluent.io/helm
helm repo update

helm upgrade --install confluent-operator \
  confluentinc/confluent-for-kubernetes \
  --namespace confluent \
  --set namespaced=false \
  --version 0.1149.x          # CFK chart matching CP 7.7

kubectl -n confluent get pods -l app.kubernetes.io/name=confluent-operator

In production you do not run that helm command by hand. The Helm release is declared as an Argo CD Application, and Argo CD reconciles it from Git. The block below is the GitOps source of truth:

# argocd/confluent-operator.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: confluent-operator
  namespace: argocd
spec:
  project: platform
  source:
    repoURL: https://packages.confluent.io/helm
    chart: confluent-for-kubernetes
    targetRevision: 0.1149.x
    helm:
      parameters:
        - { name: namespaced, value: "false" }
  destination:
    server: https://kubernetes.default.svc
    namespace: confluent
  syncPolicy:
    automated: { prune: true, selfHeal: true }

3. Wire TLS certificates from Vault

CFK supports auto-generated certs, but here Vault is the CA so leaf certs are short-lived and centrally revocable. Issue the CA bundle and a server keypair from the Vault PKI role provisioned in step 1, and load them as the cluster’s certificate authority. The confluent CLI’s helper generates the right Secret shape, or use Vault directly:

# Pull the CA chain Vault will sign with
vault read -field=certificate pki_kafka/cert/ca > cacerts.pem

# Create the CA secret CFK uses to auto-generate per-component server certs
kubectl -n confluent create secret generic ca-pair-sslcerts \
  --from-file=ca.crt=cacerts.pem \
  --from-file=ca.key=ca.key      # the issuing CA key, injected via Vault Agent, not on disk

For dynamic per-broker leaf certs, annotate the namespace so the Vault Agent injector mounts a freshly issued cert into each broker pod, and point CFK at that path. The practical payoff: when a cert is 30 days from expiry Vault re-issues it and the agent rotates it in place, so nobody is paged at 2 a.m. for an expired Kafka listener.

4. Deploy the KRaft controllers and the Kafka brokers

This is the core. One KRaftController CR (three replicas) forms the metadata quorum; one Kafka CR (three replicas) is the data plane. Note the explicit TLS config, the storage size bound to the kafka-fast class, anti-affinity across zones, and the resource requests.

# kafka-cluster.yaml
apiVersion: platform.confluent.io/v1beta1
kind: KRaftController
metadata:
  name: kraftcontroller
  namespace: confluent
spec:
  replicas: 3
  image:
    application: confluentinc/cp-server:7.7.0
    init: confluentinc/confluent-init-container:2.9.0
  dataVolumeCapacity: 10Gi
  storageClass: { name: kafka-fast }
  tls:
    secretRef: ca-pair-sslcerts        # signed by Vault PKI
---
apiVersion: platform.confluent.io/v1beta1
kind: Kafka
metadata:
  name: kafka
  namespace: confluent
spec:
  replicas: 3
  image:
    application: confluentinc/cp-server:7.7.0
    init: confluentinc/confluent-init-container:2.9.0
  dataVolumeCapacity: 100Gi
  storageClass: { name: kafka-fast }
  dependencies:
    kRaftController:
      controllerListener:
        tls: { enabled: true }
      clusterRef: { name: kraftcontroller }
  podTemplate:
    resources:
      requests: { cpu: "2", memory: 8Gi }
      limits:   { memory: 8Gi }
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: topology.kubernetes.io/zone
            labelSelector:
              matchLabels: { app: kafka }
  listeners:
    internal:
      authentication: { type: mtls }
      tls: { enabled: true }
    external:
      externalAccess:
        type: loadBalancer
        loadBalancer:
          domain: kafka.internal.acme.com
          bootstrapPrefix: bootstrap
      authentication: { type: mtls }
      tls: { enabled: true }
  configOverrides:
    server:
      - "min.insync.replicas=2"          # with RF=3, tolerate one broker loss
      - "default.replication.factor=3"
      - "auto.create.topics.enable=false"  # topics are declared, never accidental

Apply via Git (kubectl apply -f kafka-cluster.yaml only for a throwaway dev cluster; otherwise commit it and let Argo CD sync):

kubectl apply -f kafka-cluster.yaml
kubectl -n confluent rollout status statefulset/kafka --timeout=600s
kubectl -n confluent get kafka kafka -o jsonpath='{.status.phase}'   # want: RUNNING

The min.insync.replicas=2 with default.replication.factor=3 is the durability contract: a produce with acks=all only succeeds when the leader plus one follower have the record, so a single broker failure never loses an acknowledged write — and auto.create.topics.enable=false means a typo’d topic name fails loudly instead of silently spawning an unmanaged topic.

5. Enable RBAC backed by Entra ID

Kafka RBAC binds principals to roles on resources. The principals come from OIDC — engineers authenticate through Entra ID (federated upstream from Okta), CI/connectors use mTLS principals. CFK configures the Metadata Service (MDS) that issues and validates RBAC tokens. Enable it on the Kafka CR and declare role bindings as ConfluentRolebinding CRs:

# rbac.yaml
apiVersion: platform.confluent.io/v1beta1
kind: ConfluentRolebinding
metadata:
  name: orders-team-developerwrite
  namespace: confluent
spec:
  principal:
    type: group
    name: "kafka-orders-producers"     # an Entra ID security group, surfaced via OIDC
  role: DeveloperWrite
  resourcePatterns:
    - { resourceType: Topic, name: "orders.", patternType: PREFIXED }
---
apiVersion: platform.confluent.io/v1beta1
kind: ConfluentRolebinding
metadata:
  name: analytics-readonly
  namespace: confluent
spec:
  principal: { type: group, name: "kafka-analytics-consumers" }
  role: DeveloperRead
  resourcePatterns:
    - { resourceType: Topic, name: "orders.", patternType: PREFIXED }
    - { resourceType: Group, name: "analytics-", patternType: PREFIXED }

The win here is least privilege by default: the orders team can write to orders.* and nothing else, the analytics team can only read, and because the principal is an Entra ID group, access is granted and revoked by HR/IT group membership — a leaver loses Kafka access the moment they leave the directory group, with no Kafka-side change. Every new third-party role binding routes through a ServiceNow change request before Argo CD is permitted to sync it.

6. Deploy Schema Registry

Schema Registry enforces the schema contract so a producer cannot ship a breaking change that silently corrupts every consumer. It runs as its own CR, depends on Kafka over TLS, and is set to reject incompatible schemas.

# schema-registry.yaml
apiVersion: platform.confluent.io/v1beta1
kind: SchemaRegistry
metadata:
  name: schemaregistry
  namespace: confluent
spec:
  replicas: 2
  image: { application: confluentinc/cp-schema-registry:7.7.0, init: confluentinc/confluent-init-container:2.9.0 }
  dependencies:
    kafka:
      bootstrapEndpoint: kafka.confluent.svc.cluster.local:9071
      authentication: { type: mtls }
      tls: { enabled: true }
  tls: { secretRef: ca-pair-sslcerts }

Set the global compatibility level to BACKWARD (new schema can read old data) so consumers never break, and enforce it in CI — GitHub Actions runs schema-registry-maven-plugin:test-compatibility against the live registry on every PR, so a breaking Avro change fails the build, not production:

kubectl apply -f schema-registry.yaml
kubectl -n confluent rollout status statefulset/schemaregistry --timeout=300s

# Pin global compatibility to BACKWARD via the REST API (through the in-cluster service)
curl -s --cacert cacerts.pem \
  -X PUT https://schemaregistry.confluent.svc.cluster.local:8081/config \
  -H "Content-Type: application/json" \
  -d '{"compatibility": "BACKWARD"}'

7. Deploy Kafka Connect

Kafka Connect runs source and sink connectors — here, a sink that streams orders.* into the analytics warehouse. The connector’s credentials come from Vault (injected, never inline). Deploy the Connect cluster, then register a connector with a Connector CR.

# connect.yaml
apiVersion: platform.confluent.io/v1beta1
kind: Connect
metadata:
  name: connect
  namespace: confluent
spec:
  replicas: 2
  image: { application: confluentinc/cp-server-connect:7.7.0, init: confluentinc/confluent-init-container:2.9.0 }
  dependencies:
    kafka:
      bootstrapEndpoint: kafka.confluent.svc.cluster.local:9071
      authentication: { type: mtls }
      tls: { enabled: true }
  podTemplate:
    podSecurityContext: { fsGroup: 1000, runAsUser: 1000 }
---
apiVersion: platform.confluent.io/v1beta1
kind: Connector
metadata:
  name: orders-to-warehouse
  namespace: confluent
spec:
  class: "io.confluent.connect.jdbc.JdbcSinkConnector"
  taskMax: 3
  connectClusterRef: { name: connect }
  configs:
    topics: "orders.created,orders.shipped"
    connection.url: "${vault:secret/data/connect/warehouse#jdbc_url}"   # resolved by Vault
    insert.mode: "upsert"
    pk.mode: "record_key"

kubectl apply -f connect.yaml
kubectl -n confluent rollout status statefulset/connect --timeout=300s
kubectl -n confluent get connector orders-to-warehouse -o jsonpath='{.status.connectorState}'  # RUNNING

8. Hand the cluster to observability and runtime security

Point Dynatrace (OneAgent on the node pool, plus the JMX/Prometheus extension) or Datadog (the Datadog Agent with the Kafka and Kafka Connect integrations) at the brokers so you get per-broker request latency, under-replicated partition counts, consumer-group lag, and disk-usage trends — the four metrics that predict a Kafka incident before it happens. Deploy CrowdStrike Falcon as a node-level DaemonSet sensor so runtime threats on the Kafka nodes feed the SOC, and let Wiz run continuous posture scanning so an accidental plaintext listener or a public LoadBalancer drift raises an alert immediately.

# Expose broker JMX as Prometheus for the agent to scrape (CFK ships the exporter)
kubectl -n confluent get svc kafka-0-internal -o yaml | grep -A2 metrics
# Confirm Datadog/Dynatrace is reading consumer lag:
kubectl -n confluent exec kafka-0 -- kafka-consumer-groups \
  --bootstrap-server localhost:9071 --describe --group analytics-warehouse \
  --command-config /mnt/sslcerts/client.properties

Validation

Walk the data path end to end before declaring success. Use the in-cluster Kafka tooling with a client config that presents an mTLS cert.

# 1. Cluster + components are RUNNING
kubectl -n confluent get kafka,kraftcontroller,schemaregistry,connect

# 2. Create a managed topic (declarative, not auto-created) and confirm RF=3
kubectl apply -f - <<'EOF'
apiVersion: platform.confluent.io/v1beta1
kind: KafkaTopic
metadata: { name: orders.created, namespace: confluent }
spec:
  replicas: 3
  partitions: 12
  configs: { "min.insync.replicas": "2" }
EOF
kubectl -n confluent exec kafka-0 -- kafka-topics \
  --bootstrap-server localhost:9071 --describe --topic orders.created \
  --command-config /mnt/sslcerts/client.properties

# 3. Produce and consume a record over TLS
kubectl -n confluent exec -it kafka-0 -- bash -c \
  'echo "{\"orderId\":\"A-1\"}" | kafka-console-producer \
    --bootstrap-server localhost:9071 --topic orders.created \
    --producer.config /mnt/sslcerts/client.properties'

kubectl -n confluent exec kafka-0 -- kafka-console-consumer \
  --bootstrap-server localhost:9071 --topic orders.created --from-beginning --max-messages 1 \
  --consumer.config /mnt/sslcerts/client.properties

# 4. Register a schema and confirm BACKWARD compatibility is enforced
curl -s --cacert cacerts.pem https://schemaregistry.confluent.svc.cluster.local:8081/config

# 5. Verify RBAC denies an unauthorized principal (should return AuthorizationException)
kafka-topics --bootstrap-server kafka.internal.acme.com:9092 --list \
  --command-config /tmp/unauthorized-client.properties

A green run is: all four CRs RUNNING, orders.created showing ReplicationFactor: 3 and Isr: 3, a record round-tripping, the registry returning BACKWARD, and the unauthorized client being denied.

Rollback and teardown

Because everything is declarative, rollback is a Git revert that Argo CD reconciles. To tear a cluster down by hand, delete the workload CRs first (so the operator drains and removes StatefulSets cleanly), then the operator, then — deliberately and last — the PersistentVolumeClaims, because the Retain reclaim policy keeps broker data even after the PVCs are gone.

# 1. Remove workloads (operator drains brokers gracefully)
kubectl -n confluent delete connector --all
kubectl -n confluent delete connect connect
kubectl -n confluent delete schemaregistry schemaregistry
kubectl -n confluent delete kafka kafka
kubectl -n confluent delete kraftcontroller kraftcontroller

# 2. Remove the operator
helm -n confluent uninstall confluent-operator

# 3. DELIBERATELY remove data last (irreversible)
kubectl -n confluent delete pvc -l app=kafka
kubectl -n confluent delete pvc -l app=kraftcontroller

For a rollback rather than a teardown, git revert the offending commit; Argo CD’s selfHeal re-applies the previous known-good CR set. Never edit a live CR with kubectl edit in production — the next Argo sync will overwrite it and you will have lost the change.

Common pitfalls

Using volumeBindingMode: Immediate storage. The PVC binds to a zone before the pod is scheduled, so the broker pod and its disk can land in different zones and the pod never starts. Always use WaitForFirstConsumer for the broker StorageClass.
Setting reclaim_policy: Delete on broker storage. A namespace delete or an over-eager Argo prune then wipes Kafka data permanently. Use Retain and delete PVCs manually as a separate, deliberate step.
Leaving auto.create.topics.enable=true. A consumer subscribing to a misspelled topic silently creates an unmanaged, single-partition, RF-1 topic that quietly drops data on a broker loss. Disable it and declare topics with KafkaTopic CRs.
min.insync.replicas equal to the replication factor. With RF=3 and min.insync.replicas=3, a single broker reboot blocks all acks=all produces. Keep it at 2 so you can tolerate one broker down.
Forgetting pod anti-affinity. Without it, the scheduler can stack all three brokers on one node, and a single node failure takes the whole cluster offline. Pin anti-affinity to topology.kubernetes.io/zone.
Self-signed certs with no rotation. A hand-rolled cert expires unnoticed and the entire cluster’s TLS listeners go dark at once. Letting Vault PKI issue and the agent rotate removes that whole class of 2 a.m. page.

Security notes

The cluster is TLS-everywhere by construction: mTLS on the internal and controller listeners, TLS on the external listener, all signed by Vault’s PKI engine so leaf certs are short-lived and revocable. RBAC binds Entra ID groups (federated from Okta) to least-privilege roles, so access follows directory membership and a leaver is de-authorized automatically. Connector secrets and the license live in Vault, injected at runtime, never in a plain Kubernetes Secret or a Git-committed config. Wiz continuously checks posture so a plaintext-listener or public-LoadBalancer drift alarms immediately, and Wiz Code blocks the same misconfigurations in IaC before merge. CrowdStrike Falcon node sensors feed runtime detections to the SOC, and ServiceNow is the documented change gate for any new external listener or third-party role binding. Where a Schema Registry or Control Center surface is exposed to partners, Akamai terminates TLS and provides WAF at the edge.

Cost notes

The dominant cost is persistent disk and the always-on broker compute, not the Kafka software. Size dataVolumeCapacity to real retention — set per-topic retention.ms aggressively (orders events rarely need 7 days) so you are not paying to store data nobody reads, and use gp3 with provisioned IOPS rather than the pricier io2 unless a topic genuinely needs it. Three brokers and three KRaft controllers is the floor for production durability; do not over-provision replicas chasing headroom you can add later with allowVolumeExpansion and broker scale-out. Right-size the podTemplate resource requests to observed usage in Dynatrace/Datadog rather than guessing high, and let the Connect cluster scale taskMax to load instead of running idle workers. Topic compaction (cleanup.policy=compact) on changelog-style topics keeps only the latest value per key and can cut storage for those topics by an order of magnitude.