Data Platform

Deploy Confluent Platform for Apache Kafka on Kubernetes with the Confluent Operator

A retail-logistics company is drowning in point-to-point integrations: the warehouse management system pokes the order service over a brittle REST call, the fraud team scrapes a read replica every five minutes, and the analytics team’s nightly batch is always six hours stale by the time anyone looks at it. The platform team’s mandate is to put a real event backbone in the middle — every parcel scan, every order state change, every inventory delta published once to Apache Kafka and consumed by anyone who needs it, in order, durably, with a schema contract that stops a producer from silently breaking ten downstream consumers. They already run everything on Kubernetes, so a managed cloud Kafka is off the table for the regulated on-prem workloads; they need Kafka in their own cluster, operated like a first-class platform service rather than a pet. This guide walks through deploying that backbone with Confluent for Kubernetes (CFK) — the official Confluent operator — running a three-broker (KRaft) Kafka cluster, Schema Registry, and Kafka Connect, all secured with TLS and RBAC, and slotted into the enterprise’s existing identity, secrets, GitOps, and observability tooling.

Prerequisites

Target topology

Deploy Confluent Platform for Apache Kafka on Kubernetes with the Confluent Operator — topology

Everything lives in a single namespace, confluent. CFK — the operator — runs as a Deployment and watches a set of custom resources (Kafka, KRaftController, SchemaRegistry, Connect, KafkaTopic, KafkaRestClass, ConfluentRolebinding). You declare what you want as YAML; the operator reconciles brokers, controllers, StatefulSets, Services, and certificates to match. The three KRaft controllers form the metadata quorum (replacing the old ZooKeeper ensemble), the three brokers form the data plane, Schema Registry enforces schema contracts on the wire, and Kafka Connect runs source/sink connectors. Inter-component traffic is mTLS; client traffic is TLS with RBAC. Producers and consumers reach the cluster through a bootstrap Service; external systems reach it through a dedicated external listener.

The supporting cast around the cluster is what turns it from a demo into a platform:

1. Provision the cluster prerequisites with Terraform

Stand up (or reuse) the Kubernetes cluster, a fast StorageClass, and the Vault PKI mount with Terraform so the substrate is reproducible. The Kafka-specific pieces are the StorageClass and the Vault PKI role; the cluster itself is whatever your platform already uses.

# storageclass.tf — fast block storage, expandable, late-bound to the AZ the pod lands in
resource "kubernetes_storage_class" "kafka_fast" {
  metadata { name = "kafka-fast" }
  storage_provisioner    = "ebs.csi.aws.com"
  reclaim_policy         = "Retain"          # never auto-delete broker data
  volume_binding_mode    = "WaitForFirstConsumer"
  allow_volume_expansion = true
  parameters = {
    type       = "gp3"
    iops       = "6000"
    throughput = "250"
    encrypted  = "true"
  }
}

# vault-pki.tf — internal CA that issues short-lived Kafka listener certs
resource "vault_mount" "pki_kafka" {
  path        = "pki_kafka"
  type        = "pki"
  max_lease_ttl_seconds = 7776000            # 90d issuing-CA lifetime
}

resource "vault_pki_secret_backend_role" "kafka" {
  backend          = vault_mount.pki_kafka.path
  name             = "kafka-internal"
  allowed_domains  = ["confluent.svc.cluster.local", "kafka.internal.acme.com"]
  allow_subdomains = true
  max_ttl          = "720h"                  # 30d leaf certs, auto-renewed
  key_type         = "rsa"
  key_bits         = 2048
}
terraform init && terraform apply -auto-approve
kubectl get storageclass kafka-fast   # confirm it exists and is the intended class

Run wiz-code iac scan . in the pipeline at this point — Wiz Code flags an unencrypted volume, a reclaim_policy of Delete on stateful storage, or a public endpoint before any of it reaches the cluster.

2. Install the Confluent for Kubernetes operator

Add the Confluent Helm repo and install CFK into the confluent namespace. The operator is cluster-scoped here so it can manage future namespaces, but the workload runs in confluent.

kubectl create namespace confluent

helm repo add confluentinc https://packages.confluent.io/helm
helm repo update

helm upgrade --install confluent-operator \
  confluentinc/confluent-for-kubernetes \
  --namespace confluent \
  --set namespaced=false \
  --version 0.1149.x          # CFK chart matching CP 7.7

kubectl -n confluent get pods -l app.kubernetes.io/name=confluent-operator

In production you do not run that helm command by hand. The Helm release is declared as an Argo CD Application, and Argo CD reconciles it from Git. The block below is the GitOps source of truth:

# argocd/confluent-operator.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: confluent-operator
  namespace: argocd
spec:
  project: platform
  source:
    repoURL: https://packages.confluent.io/helm
    chart: confluent-for-kubernetes
    targetRevision: 0.1149.x
    helm:
      parameters:
        - { name: namespaced, value: "false" }
  destination:
    server: https://kubernetes.default.svc
    namespace: confluent
  syncPolicy:
    automated: { prune: true, selfHeal: true }

3. Wire TLS certificates from Vault

CFK supports auto-generated certs, but here Vault is the CA so leaf certs are short-lived and centrally revocable. Issue the CA bundle and a server keypair from the Vault PKI role provisioned in step 1, and load them as the cluster’s certificate authority. The confluent CLI’s helper generates the right Secret shape, or use Vault directly:

# Pull the CA chain Vault will sign with
vault read -field=certificate pki_kafka/cert/ca > cacerts.pem

# Create the CA secret CFK uses to auto-generate per-component server certs
kubectl -n confluent create secret generic ca-pair-sslcerts \
  --from-file=ca.crt=cacerts.pem \
  --from-file=ca.key=ca.key      # the issuing CA key, injected via Vault Agent, not on disk

For dynamic per-broker leaf certs, annotate the namespace so the Vault Agent injector mounts a freshly issued cert into each broker pod, and point CFK at that path. The practical payoff: when a cert is 30 days from expiry Vault re-issues it and the agent rotates it in place, so nobody is paged at 2 a.m. for an expired Kafka listener.

4. Deploy the KRaft controllers and the Kafka brokers

This is the core. One KRaftController CR (three replicas) forms the metadata quorum; one Kafka CR (three replicas) is the data plane. Note the explicit TLS config, the storage size bound to the kafka-fast class, anti-affinity across zones, and the resource requests.

# kafka-cluster.yaml
apiVersion: platform.confluent.io/v1beta1
kind: KRaftController
metadata:
  name: kraftcontroller
  namespace: confluent
spec:
  replicas: 3
  image:
    application: confluentinc/cp-server:7.7.0
    init: confluentinc/confluent-init-container:2.9.0
  dataVolumeCapacity: 10Gi
  storageClass: { name: kafka-fast }
  tls:
    secretRef: ca-pair-sslcerts        # signed by Vault PKI
---
apiVersion: platform.confluent.io/v1beta1
kind: Kafka
metadata:
  name: kafka
  namespace: confluent
spec:
  replicas: 3
  image:
    application: confluentinc/cp-server:7.7.0
    init: confluentinc/confluent-init-container:2.9.0
  dataVolumeCapacity: 100Gi
  storageClass: { name: kafka-fast }
  dependencies:
    kRaftController:
      controllerListener:
        tls: { enabled: true }
      clusterRef: { name: kraftcontroller }
  podTemplate:
    resources:
      requests: { cpu: "2", memory: 8Gi }
      limits:   { memory: 8Gi }
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: topology.kubernetes.io/zone
            labelSelector:
              matchLabels: { app: kafka }
  listeners:
    internal:
      authentication: { type: mtls }
      tls: { enabled: true }
    external:
      externalAccess:
        type: loadBalancer
        loadBalancer:
          domain: kafka.internal.acme.com
          bootstrapPrefix: bootstrap
      authentication: { type: mtls }
      tls: { enabled: true }
  configOverrides:
    server:
      - "min.insync.replicas=2"          # with RF=3, tolerate one broker loss
      - "default.replication.factor=3"
      - "auto.create.topics.enable=false"  # topics are declared, never accidental

Apply via Git (kubectl apply -f kafka-cluster.yaml only for a throwaway dev cluster; otherwise commit it and let Argo CD sync):

kubectl apply -f kafka-cluster.yaml
kubectl -n confluent rollout status statefulset/kafka --timeout=600s
kubectl -n confluent get kafka kafka -o jsonpath='{.status.phase}'   # want: RUNNING

The min.insync.replicas=2 with default.replication.factor=3 is the durability contract: a produce with acks=all only succeeds when the leader plus one follower have the record, so a single broker failure never loses an acknowledged write — and auto.create.topics.enable=false means a typo’d topic name fails loudly instead of silently spawning an unmanaged topic.

5. Enable RBAC backed by Entra ID

Kafka RBAC binds principals to roles on resources. The principals come from OIDC — engineers authenticate through Entra ID (federated upstream from Okta), CI/connectors use mTLS principals. CFK configures the Metadata Service (MDS) that issues and validates RBAC tokens. Enable it on the Kafka CR and declare role bindings as ConfluentRolebinding CRs:

# rbac.yaml
apiVersion: platform.confluent.io/v1beta1
kind: ConfluentRolebinding
metadata:
  name: orders-team-developerwrite
  namespace: confluent
spec:
  principal:
    type: group
    name: "kafka-orders-producers"     # an Entra ID security group, surfaced via OIDC
  role: DeveloperWrite
  resourcePatterns:
    - { resourceType: Topic, name: "orders.", patternType: PREFIXED }
---
apiVersion: platform.confluent.io/v1beta1
kind: ConfluentRolebinding
metadata:
  name: analytics-readonly
  namespace: confluent
spec:
  principal: { type: group, name: "kafka-analytics-consumers" }
  role: DeveloperRead
  resourcePatterns:
    - { resourceType: Topic, name: "orders.", patternType: PREFIXED }
    - { resourceType: Group, name: "analytics-", patternType: PREFIXED }

The win here is least privilege by default: the orders team can write to orders.* and nothing else, the analytics team can only read, and because the principal is an Entra ID group, access is granted and revoked by HR/IT group membership — a leaver loses Kafka access the moment they leave the directory group, with no Kafka-side change. Every new third-party role binding routes through a ServiceNow change request before Argo CD is permitted to sync it.

6. Deploy Schema Registry

Schema Registry enforces the schema contract so a producer cannot ship a breaking change that silently corrupts every consumer. It runs as its own CR, depends on Kafka over TLS, and is set to reject incompatible schemas.

# schema-registry.yaml
apiVersion: platform.confluent.io/v1beta1
kind: SchemaRegistry
metadata:
  name: schemaregistry
  namespace: confluent
spec:
  replicas: 2
  image: { application: confluentinc/cp-schema-registry:7.7.0, init: confluentinc/confluent-init-container:2.9.0 }
  dependencies:
    kafka:
      bootstrapEndpoint: kafka.confluent.svc.cluster.local:9071
      authentication: { type: mtls }
      tls: { enabled: true }
  tls: { secretRef: ca-pair-sslcerts }

Set the global compatibility level to BACKWARD (new schema can read old data) so consumers never break, and enforce it in CI — GitHub Actions runs schema-registry-maven-plugin:test-compatibility against the live registry on every PR, so a breaking Avro change fails the build, not production:

kubectl apply -f schema-registry.yaml
kubectl -n confluent rollout status statefulset/schemaregistry --timeout=300s

# Pin global compatibility to BACKWARD via the REST API (through the in-cluster service)
curl -s --cacert cacerts.pem \
  -X PUT https://schemaregistry.confluent.svc.cluster.local:8081/config \
  -H "Content-Type: application/json" \
  -d '{"compatibility": "BACKWARD"}'

7. Deploy Kafka Connect

Kafka Connect runs source and sink connectors — here, a sink that streams orders.* into the analytics warehouse. The connector’s credentials come from Vault (injected, never inline). Deploy the Connect cluster, then register a connector with a Connector CR.

# connect.yaml
apiVersion: platform.confluent.io/v1beta1
kind: Connect
metadata:
  name: connect
  namespace: confluent
spec:
  replicas: 2
  image: { application: confluentinc/cp-server-connect:7.7.0, init: confluentinc/confluent-init-container:2.9.0 }
  dependencies:
    kafka:
      bootstrapEndpoint: kafka.confluent.svc.cluster.local:9071
      authentication: { type: mtls }
      tls: { enabled: true }
  podTemplate:
    podSecurityContext: { fsGroup: 1000, runAsUser: 1000 }
---
apiVersion: platform.confluent.io/v1beta1
kind: Connector
metadata:
  name: orders-to-warehouse
  namespace: confluent
spec:
  class: "io.confluent.connect.jdbc.JdbcSinkConnector"
  taskMax: 3
  connectClusterRef: { name: connect }
  configs:
    topics: "orders.created,orders.shipped"
    connection.url: "${vault:secret/data/connect/warehouse#jdbc_url}"   # resolved by Vault
    insert.mode: "upsert"
    pk.mode: "record_key"
kubectl apply -f connect.yaml
kubectl -n confluent rollout status statefulset/connect --timeout=300s
kubectl -n confluent get connector orders-to-warehouse -o jsonpath='{.status.connectorState}'  # RUNNING

8. Hand the cluster to observability and runtime security

Point Dynatrace (OneAgent on the node pool, plus the JMX/Prometheus extension) or Datadog (the Datadog Agent with the Kafka and Kafka Connect integrations) at the brokers so you get per-broker request latency, under-replicated partition counts, consumer-group lag, and disk-usage trends — the four metrics that predict a Kafka incident before it happens. Deploy CrowdStrike Falcon as a node-level DaemonSet sensor so runtime threats on the Kafka nodes feed the SOC, and let Wiz run continuous posture scanning so an accidental plaintext listener or a public LoadBalancer drift raises an alert immediately.

# Expose broker JMX as Prometheus for the agent to scrape (CFK ships the exporter)
kubectl -n confluent get svc kafka-0-internal -o yaml | grep -A2 metrics
# Confirm Datadog/Dynatrace is reading consumer lag:
kubectl -n confluent exec kafka-0 -- kafka-consumer-groups \
  --bootstrap-server localhost:9071 --describe --group analytics-warehouse \
  --command-config /mnt/sslcerts/client.properties

Validation

Walk the data path end to end before declaring success. Use the in-cluster Kafka tooling with a client config that presents an mTLS cert.

# 1. Cluster + components are RUNNING
kubectl -n confluent get kafka,kraftcontroller,schemaregistry,connect

# 2. Create a managed topic (declarative, not auto-created) and confirm RF=3
kubectl apply -f - <<'EOF'
apiVersion: platform.confluent.io/v1beta1
kind: KafkaTopic
metadata: { name: orders.created, namespace: confluent }
spec:
  replicas: 3
  partitions: 12
  configs: { "min.insync.replicas": "2" }
EOF
kubectl -n confluent exec kafka-0 -- kafka-topics \
  --bootstrap-server localhost:9071 --describe --topic orders.created \
  --command-config /mnt/sslcerts/client.properties

# 3. Produce and consume a record over TLS
kubectl -n confluent exec -it kafka-0 -- bash -c \
  'echo "{\"orderId\":\"A-1\"}" | kafka-console-producer \
    --bootstrap-server localhost:9071 --topic orders.created \
    --producer.config /mnt/sslcerts/client.properties'

kubectl -n confluent exec kafka-0 -- kafka-console-consumer \
  --bootstrap-server localhost:9071 --topic orders.created --from-beginning --max-messages 1 \
  --consumer.config /mnt/sslcerts/client.properties

# 4. Register a schema and confirm BACKWARD compatibility is enforced
curl -s --cacert cacerts.pem https://schemaregistry.confluent.svc.cluster.local:8081/config

# 5. Verify RBAC denies an unauthorized principal (should return AuthorizationException)
kafka-topics --bootstrap-server kafka.internal.acme.com:9092 --list \
  --command-config /tmp/unauthorized-client.properties

A green run is: all four CRs RUNNING, orders.created showing ReplicationFactor: 3 and Isr: 3, a record round-tripping, the registry returning BACKWARD, and the unauthorized client being denied.

Rollback and teardown

Because everything is declarative, rollback is a Git revert that Argo CD reconciles. To tear a cluster down by hand, delete the workload CRs first (so the operator drains and removes StatefulSets cleanly), then the operator, then — deliberately and last — the PersistentVolumeClaims, because the Retain reclaim policy keeps broker data even after the PVCs are gone.

# 1. Remove workloads (operator drains brokers gracefully)
kubectl -n confluent delete connector --all
kubectl -n confluent delete connect connect
kubectl -n confluent delete schemaregistry schemaregistry
kubectl -n confluent delete kafka kafka
kubectl -n confluent delete kraftcontroller kraftcontroller

# 2. Remove the operator
helm -n confluent uninstall confluent-operator

# 3. DELIBERATELY remove data last (irreversible)
kubectl -n confluent delete pvc -l app=kafka
kubectl -n confluent delete pvc -l app=kraftcontroller

For a rollback rather than a teardown, git revert the offending commit; Argo CD’s selfHeal re-applies the previous known-good CR set. Never edit a live CR with kubectl edit in production — the next Argo sync will overwrite it and you will have lost the change.

Common pitfalls

Security notes

The cluster is TLS-everywhere by construction: mTLS on the internal and controller listeners, TLS on the external listener, all signed by Vault’s PKI engine so leaf certs are short-lived and revocable. RBAC binds Entra ID groups (federated from Okta) to least-privilege roles, so access follows directory membership and a leaver is de-authorized automatically. Connector secrets and the license live in Vault, injected at runtime, never in a plain Kubernetes Secret or a Git-committed config. Wiz continuously checks posture so a plaintext-listener or public-LoadBalancer drift alarms immediately, and Wiz Code blocks the same misconfigurations in IaC before merge. CrowdStrike Falcon node sensors feed runtime detections to the SOC, and ServiceNow is the documented change gate for any new external listener or third-party role binding. Where a Schema Registry or Control Center surface is exposed to partners, Akamai terminates TLS and provides WAF at the edge.

Cost notes

The dominant cost is persistent disk and the always-on broker compute, not the Kafka software. Size dataVolumeCapacity to real retention — set per-topic retention.ms aggressively (orders events rarely need 7 days) so you are not paying to store data nobody reads, and use gp3 with provisioned IOPS rather than the pricier io2 unless a topic genuinely needs it. Three brokers and three KRaft controllers is the floor for production durability; do not over-provision replicas chasing headroom you can add later with allowVolumeExpansion and broker scale-out. Right-size the podTemplate resource requests to observed usage in Dynatrace/Datadog rather than guessing high, and let the Connect cluster scale taskMax to load instead of running idle workers. Topic compaction (cleanup.policy=compact) on changelog-style topics keeps only the latest value per key and can cut storage for those topics by an order of magnitude.

KafkaConfluentKubernetesCFKStreamingRBAC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading