Architecture AWS

Self-Managed Kafka on Kubernetes with Strimzi for a Trading Platform

A mid-tier proprietary trading firm — call it the kind of shop that runs a few hundred million dollars of equities and futures strategies out of a single co-located rack near the exchange — gives its platform team an ultimatum. The market-data fan-out bus that feeds every strategy engine, the order-state journal, and the post-trade risk feed all run on a managed streaming service, and during the open and the close the firm is watching its tail latency blow out from a comfortable two milliseconds to forty, with occasional throttle-induced stalls that show up directly as missed fills on the desk. A missed fill is real money, and the head of trading does not care that the cloud provider’s SLA was technically met. The mandate: own the streaming layer, get the p99 down and keep it there, prove it with numbers, and do it without hiring a five-person Kafka operations team. This article is the reference architecture for that — a self-managed, self-healing Apache Kafka platform on Kubernetes, operated by Strimzi, tuned for a low-latency trading feed, and instrumented so the firm’s CTO can see the SLO is holding in real time.

The pressures in trading stack differently from a typical enterprise. Latency is the product, not a feature — the entire reason to leave a managed service is that you cannot tune a broker you do not control, cannot pin it to the right instance type, and cannot eliminate the noisy-neighbor throttle that costs you fills. Durability still matters because the order journal is a regulated record under MiFID II / SEC 17a-4 retention, so “fast but lossy” is not on the table. Availability means surviving the loss of an Availability Zone mid-session without a trading halt. And operational leverage means a two-person platform team has to run this, which is the single fact that makes Strimzi — not a hand-rolled Kafka install — the only sane answer.

Why self-manage, and why an operator

The honest first question is whether to self-manage Kafka at all. Managed Kafka (MSK, Confluent Cloud) is the right default for most teams, and naming why this firm leaves it matters because someone will rightly challenge the decision.

The firm leaves managed Kafka for exactly three reasons that do not apply to a normal SaaS backend. First, broker placement and instance type — a low-latency feed wants brokers pinned to specific high-clock, network-optimized instances in the same AZ as the strategy engines, with local NVMe for the commit log, and a managed service abstracts that away. Second, kernel and JVM tuning — the page-cache behavior, the vm.dirty_ratio, the G1 pause targets, the NIC interrupt affinity all matter at single-digit-millisecond p99, and you cannot touch them on a managed broker. Third, throttle control — managed services protect themselves with quotas that surface as latency spikes precisely at the open and close, which is the worst possible time for this workload.

But self-managing Kafka the old way — Ansible playbooks, hand-rolled failover runbooks, 3 a.m. pages to manually reassign partitions when a broker dies — is exactly the five-person-team cost the firm is trying to avoid. Strimzi resolves the contradiction. It is a Kubernetes operator that turns Kafka into declarative custom resources: you describe the cluster you want (Kafka, KafkaNodePool, KafkaTopic, KafkaUser) in YAML, and the operator continuously reconciles reality to match — provisioning brokers, rolling upgrades one broker at a time with controlled-shutdown, regenerating TLS certs before they expire, and recovering a failed broker onto a healthy node without a human in the loop. The firm gets the control of self-managed Kafka with something close to the operational cost of a managed one. The whole platform runs on Amazon EKS so the operator, the brokers, and the firm’s own tooling share one control plane and one IAM story.

Architecture overview

Self-Managed Kafka on Kubernetes with Strimzi for a Trading Platform — architecture

The platform has two planes that share infrastructure but live on different timescales: a hot data plane — producers slamming market data and order events through brokers to consumers, where every microsecond is measured — and a control plane — the Strimzi operator, certificate rotation, MirrorMaker, and observability, which runs continuously but off the critical latency path. Keeping these separate in your head is the first step to operating this well.

The defining property of the topology is rack awareness mapped to AWS Availability Zones. The Kafka cluster runs across three AZs in one region. Strimzi labels each broker with its zone via the broker.rack config sourced from the node’s topology.kubernetes.io/zone label, and Kafka’s rack-aware replica placement then guarantees that the replicas of any partition land in different AZs. Lose an entire AZ and every partition still has a surviving in-sync replica in another zone — no data loss, no trading halt, just a leader election that completes in well under a second.

Hot path, following the data flow:

  1. Producers — the market-data gateway that normalizes the exchange feed, and the order-management system that journals every order state transition — run as pods in the same AZ as the partition leaders they write to, so the hot write path never crosses a zone boundary and never pays the inter-AZ network hop. They connect over the internal listener with mTLS.
  2. The write lands on the partition leader broker. For the order journal, the producer runs with acks=all and the topic has min.insync.replicas=2 across a replication factor of 3 — the write is only acknowledged once a second AZ has it, which is what makes the journal durable enough for a regulated record. For the ephemeral market-data fan-out, where a dropped tick is re-sent on the next quote and durability is worthless, producers run acks=1 to shave the acknowledgement latency.
  3. Kafka uses KRaft mode — no ZooKeeper — so cluster metadata and leader election live in a dedicated quorum of controller nodes, removing an entire failure-prone dependency and cutting failover time. Strimzi provisions the controllers as their own KafkaNodePool, isolated from the brokers that carry the hot data.
  4. Consumers — the strategy engines, the real-time risk aggregator, the order-state cache — read from the leader (or from a rack-local follower via follower fetching, so a consumer reads from a replica in its own AZ and avoids the cross-zone read hop). They commit offsets back to Kafka.
  5. The commit log for the latency-sensitive topics sits on local NVMe instance storage on the broker nodes for the lowest possible write latency, with tiered storage offloading older closed log segments to Amazon S3 so the brokers keep only the recent hot data on fast local disk and the long retention required for compliance lives cheaply in object storage.

Control plane, continuous and off the hot path: the Strimzi Cluster Operator watches the custom resources and reconciles the cluster. MirrorMaker 2 (run by Strimzi) replicates the order journal and other durable topics to a second AWS region for disaster recovery. Prometheus scrapes JMX metrics from every broker via the Strimzi metrics exporter, and Grafana renders the latency SLO dashboards the desk and the CTO watch.

Component breakdown

Component Service / tool Role in the platform Key configuration choices
Cluster orchestration Amazon EKS Runs the operator, brokers, controllers, MirrorMaker Dedicated node groups; AZ-spread; cluster autoscaler
Kafka lifecycle Strimzi operator Declarative provisioning, rolling upgrades, cert rotation, self-heal Kafka + KafkaNodePool CRs; KRaft; rack from node zone label
Brokers Kafka (KRaft) Hot data plane: log append, replication, leader election RF=3, min.insync.replicas=2; rack-aware placement
Controllers Kafka KRaft quorum Cluster metadata + leader election (no ZooKeeper) Separate node pool; 3 controllers across AZs
Hot log storage Local NVMe (instance store) Lowest-latency commit log for the recent window i-family nodes; per-broker JBOD; XFS
Cold log storage Tiered storage → Amazon S3 Long retention for compliance, cheaply remote.storage.enable; local retention hours, remote retention years
Cross-region DR MirrorMaker 2 Async replicate durable topics to a paired region Strimzi KafkaMirrorMaker2 CR; offset + ACL sync
Transport security mTLS (Strimzi-managed CA) Encrypt + mutually authenticate every client and broker Cluster + clients CA; auto-renew; TLS internal listener
Authorization Kafka ACLs via KafkaUser Per-service topic permissions, least privilege KafkaUser CRs; StandardAuthorizer; deny by default
Human identity Okta + Entra ID SSO to Grafana, EKS console, the platform tooling OIDC to EKS; SAML to Grafana; conditional access
Secrets HashiCorp Vault App credentials, MirrorMaker peer creds, signing keys IRSA-backed auth; dynamic leases; Agent sidecar injection
Cloud posture Wiz / Wiz Code CSPM on EKS + S3, IaC scanning of the Terraform/Helm Agentless scan; alerts on public S3 or open security group; Wiz Code in CI
Runtime security CrowdStrike Falcon Runtime threat detection on broker and operator nodes Sensor as DaemonSet; detections to the SOC
Observability Prometheus + Grafana + Dynatrace JMX metrics, latency SLO dashboards, full-stack tracing Strimzi JMX exporter; Davis anomaly detection; SLO alerts
ITSM ServiceNow Change approvals for cluster changes, incident records Change gate before a broker config change; auto-ticket on SLO breach
CI / IaC GitHub Actions + Argo CD + Terraform Build/test, GitOps deploy of CRs, infra as code OIDC to AWS; Argo CD syncs Strimzi CRs; Terraform for EKS/VPC/S3

A few of these choices deserve the why, because they are the ones teams get wrong.

Why tiered storage instead of just big local disks. A trading firm’s order journal must be retained for years, but the brokers only ever serve the recent window at low latency. Putting years of cold segments on local NVMe is ruinously expensive and forces enormous brokers whose recovery (re-replicating a full disk after a node loss) takes hours. Tiered storage keeps only a few hours of hot data on local NVMe and transparently offloads closed segments to S3; a broker’s local footprint stays small, so recovery is fast, while retention is effectively unlimited and cheap. The trade is that a consumer reading far back in history pays an S3 fetch — irrelevant for a real-time desk, acceptable for the occasional compliance replay.

Why KRaft, not ZooKeeper. ZooKeeper was a second distributed system to operate, secure, and recover, and its loss could freeze the cluster. KRaft folds metadata management into a Kafka controller quorum, removing that dependency entirely, shrinking failover time during a broker loss, and simplifying the security surface — one fewer thing for a two-person team to run and for Wiz to have to assess.

Why mTLS everywhere, not just at the edge. In a flat Kubernetes network, any compromised pod can reach a broker port. mTLS means every producer and consumer presents a client certificate the broker verifies, so an unauthorized pod cannot even open a session, and all traffic is encrypted in transit — table stakes for order data. Strimzi runs its own certificate authority, issues per-client certs through the KafkaUser resource, and rotates them automatically before expiry, which removes the classic self-managed-Kafka footgun of a cluster-wide outage when a hand-managed cert silently expires.

Implementation guidance

Provision the substrate with Terraform, then hand the cluster to GitOps. The split matters: Terraform owns the slow-moving cloud substrate (VPC across three AZs, the EKS cluster, node groups on the right instance families, the S3 tiered-storage bucket, IAM/IRSA roles), and Argo CD owns the fast-moving Kafka definition (the Strimzi CRs), so a topic or broker-config change is a reviewed Git commit that Argo syncs — never a kubectl apply from someone’s laptop.

A trimmed Kafka custom resource communicates the intent — KRaft, rack-aware, tuned listeners:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: trading-bus
  annotations:
    strimzi.io/node-pools: enabled
    strimzi.io/kraft: enabled
spec:
  kafka:
    replicas: 6
    rack:
      topologyKey: topology.kubernetes.io/zone   # spread replicas across AZs
    listeners:
      - name: tls
        port: 9093
        type: internal
        tls: true
        authentication:
          type: tls                               # mTLS for every client
    config:
      default.replication.factor: 3
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      replica.selector.class: org.apache.kafka.common.replica.RackAwareReplicaSelector  # follower fetch

And the broker node pool, pinned to local-NVMe instances with JBOD storage:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  name: brokers
  labels:
    strimzi.io/cluster: trading-bus
spec:
  replicas: 6
  roles: [broker]
  storage:
    type: jbod
    volumes:
      - id: 0
        type: ephemeral          # local NVMe instance store for the hot commit log
        sizeLimit: 800Gi

The CI that applies this runs in GitHub Actions, authenticating to AWS via OIDC federation so there is no stored access key to leak, and Wiz Code scans the Terraform and Helm/CR manifests in the pull request — flagging a publicly readable S3 bucket or an over-broad security group before it ever reaches a cluster. Argo CD then reconciles the merged CRs onto EKS.

Tune for the trading workload, because defaults are not for you. Pin brokers to network-optimized, local-NVMe instances (an i-family or comparable) so the commit log writes to local disk and the NIC has headroom for replication traffic. Set the JVM heap modestly (Kafka leans on the OS page cache, not a huge heap) with G1 pause targets tuned low. Increase num.network.threads and num.io.threads for the fan-out, and align producers and the partition leaders into the same AZ so the hot write path never crosses a zone. Use follower fetching (the RackAwareReplicaSelector above) so consumers read from an in-zone replica and avoid the cross-AZ read hop and its dollar cost. These are the knobs a managed service hides and the reason the firm left.

Configure tiered storage explicitly per topic. Enable remote storage on the durable topics and set a short local retention with a long remote one, so hot data stays on NVMe and cold data lands in S3:

remote.storage.enable=true
local.retention.ms=10800000        # 3 hours hot on local NVMe
retention.ms=220903200000          # ~7 years total, the tail living in S3

Enterprise considerations

Security & Zero Trust. The platform is Zero Trust by construction inside the cluster: mTLS authenticates and encrypts every client-broker session, and Kafka ACLs declared through KafkaUser resources enforce least privilege per service — the market-data gateway can Write only to the quote topics, a strategy engine can Read only the feeds it subscribes to, and everything else is denied by default. Layer on top: (a) HashiCorp Vault holds the few real secrets — the MirrorMaker peer-cluster credentials, third-party feed tokens, signing keys — leased dynamically and injected by the Vault Agent sidecar with IRSA-backed auth, so nothing sensitive sits in a plain Kubernetes Secret; (b) Wiz runs continuous CSPM across EKS and S3 and attack-path analysis, alerting the moment the tiered-storage bucket drifts toward public exposure or a node’s security group widens, with Wiz Code as the shift-left check in CI; © CrowdStrike Falcon sensors run as a DaemonSet on every broker and operator node for runtime threat detection, feeding the firm’s SOC; (d) human access to Grafana, the EKS console, and the platform tooling federates through Okta (brokered to Entra ID where Azure-side tooling needs a native token), so engineers authenticate once under conditional access rather than sharing a kubeconfig. An SLO breach or a Falcon detection auto-raises a ServiceNow incident so there is a ticket and an audit trail, not just a log line.

Cost optimization. Self-managed is not automatically cheaper; engineer it to be.

Lever Mechanism Typical effect
Tiered storage Cold segments to S3 instead of giant local disks Slashes per-broker storage and shrinks recovery time
Right-sized brokers Pin to the instance family that fits IO + network, no more Avoids paying for idle CPU on oversized nodes
In-AZ traffic alignment Producers/consumers read-write in-zone via follower fetch Cuts inter-AZ data-transfer charges, often a top-3 line item
acks per topic acks=1 for ephemeral feeds, acks=all only where durable Lower replication overhead on the high-volume fan-out
Spot for non-critical MirrorMaker / batch consumers on Spot, brokers on On-Demand Cheaper control-plane and consumer compute

The inter-AZ alignment lever is the one teams forget: a chatty cross-zone consumer can quietly make data-transfer the largest bill on the platform, and follower fetching plus AZ-pinned producers is what keeps it down.

Scalability. Each tier scales independently. Brokers scale out by raising replicas on the KafkaNodePool; Strimzi provisions the new broker and you rebalance partitions onto it with Cruise Control (which Strimzi integrates) rather than a hand-built reassignment plan. Topic throughput scales with partition count — size it for the consumer parallelism the strategy engines need, since a partition is the unit of consumer concurrency. EKS node groups scale via the cluster autoscaler. The natural ceiling on a single cluster is metadata and replication overhead, which is why a firm at real scale shards by domain (a market-data cluster separate from the order-journal cluster) rather than one mega-cluster.

Failure modes, and what each one looks like. Name them before they page you.

Reliability & DR (RTO/RPO). Decide the numbers per tier. Within the region, the rack-aware three-AZ layout gives zero data loss and sub-second recovery for an AZ failure. For a full regional outage, MirrorMaker 2 asynchronously replicates the durable topics (the order journal above all) to a paired AWS region, syncing both the data and the consumer-group offsets so a failed-over consumer resumes near where it stopped. Async replication means the cross-region RPO is the replication lag — typically seconds — and the RTO is how fast you repoint producers and consumers at the DR cluster, which a runbook plus DNS makes minutes. A pragmatic target for this platform: in-region RTO under one minute and RPO zero; cross-region RTO ~15 minutes and RPO seconds. The order journal’s S3-backed tiered storage is the durable backstop — even a lost cluster is rebuildable from object storage.

Observability and the SLO contract. This is where self-managing earns its keep, because the whole project is justified by proving latency. The Strimzi JMX exporter exposes every broker metric to Prometheus, and Grafana renders the dashboards the desk and CTO watch. The SLOs are explicit and alerted:

SLO Metric Target
Producer ack latency (order journal) producer request-latency p99 < 5 ms
End-to-end feed latency produce-to-consume p99 < 8 ms
Consumer lag (strategy engines) records-lag-max < 1000 records
Replication health UnderReplicatedPartitions 0
Availability broker / controller uptime 99.99%

Dynatrace sits above the Kafka-native metrics with full-stack distributed tracing and Davis anomaly detection, correlating a producer-side latency spike to a node, a network event, or a GC pause so the two-person team gets a root cause, not just a red graph — and surfacing a regression on its own before it trips the SLO. A sustained SLO breach auto-raises a ServiceNow incident.

Governance. Pin the Kafka and Strimzi versions explicitly and promote upgrades through a staging cluster — Strimzi rolls a version change one broker at a time with controlled shutdown, but you still gate it. Keep every KafkaTopic, KafkaUser, and broker config in Git, reviewed and instantly revertable, with Argo CD as the single applier so the live cluster never drifts from the repo. Route cluster changes through a ServiceNow change approval for an audit trail, and rely on Wiz as the independent check that the posture (no public S3, mTLS on, ACLs enforced) is actually holding.

Explicit tradeoffs

Accept these or do not build it. Self-managing Kafka — even with Strimzi doing the heavy lifting — means you now own broker tuning, capacity planning, partition rebalancing, and the on-call for a stateful distributed system, which a managed service carried for you. Strimzi shrinks that load dramatically but does not erase it: someone still has to understand KRaft quorums, ISR dynamics, and why a partition went under-replicated. Tiered storage adds an S3 dependency and a small fetch-latency cliff for historical reads. The rack-aware, in-AZ-aligned topology that delivers the latency and the data-transfer savings is more design effort than a single-zone cluster, and getting follower fetching and producer placement right is fiddly. And running this on EKS means the platform team owns both Kubernetes and Kafka — two deep systems — which is only worth it because the operator makes the Kafka half tractable for a small team.

The alternatives, and when they win. If latency is not your product and you just need durable streaming, managed Kafka (MSK or Confluent Cloud) is the right default — less to operate, and you give up exactly the tuning this firm needed. If your workload is simple queue-and-fan-out rather than a high-throughput replayable log, a managed message queue (SQS/SNS, or a lighter broker) is simpler than any Kafka at all. If you need stream processing on top — windowed aggregations, joins — add Kafka Streams or Flink as consumers rather than pushing logic into the brokers. And if you are a small team without Kubernetes expertise, running Kafka on dedicated VMs with Ansible trades the EKS learning curve for a more manual operational model — viable, but it gives back much of the self-healing that made Strimzi worth choosing.

The shape of the win

For the trading desk, the payoff is not “we run our own Kafka.” It is that during the open and the close — the moments that decide the firm’s P&L — the produce-to-consume p99 sits under eight milliseconds on a Grafana panel the head of trading can see, an AZ can fail without a single missed fill, the order journal is provably durable and retained for the regulator, and a two-person platform team runs the whole thing because Strimzi handles the broker lifecycle, certificate rotation, and self-healing that used to demand a dedicated ops squad. That combination — managed-service-level operational cost with self-managed-level control over latency — is the one that justifies the build. Everything upstream — the rack-aware AZ spread, the KRaft controllers, the tiered storage to S3, the mTLS and ACLs, the Vault-held secrets, the Wiz posture scanning, the Dynatrace anomaly detection, the MirrorMaker DR — exists so the desk gets its fills and the CTO can prove the SLO is holding. The architecture here is the destination; start with a single durable cluster if you must, but a regulated, latency-critical trading feed is where self-managed Kafka on Kubernetes has to land.

KafkaStrimziKubernetesEKSLow LatencyTrading
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading