A mid-tier proprietary trading firm — call it the kind of shop that runs a few hundred million dollars of equities and futures strategies out of a single co-located rack near the exchange — gives its platform team an ultimatum. The market-data fan-out bus that feeds every strategy engine, the order-state journal, and the post-trade risk feed all run on a managed streaming service, and during the open and the close the firm is watching its tail latency blow out from a comfortable two milliseconds to forty, with occasional throttle-induced stalls that show up directly as missed fills on the desk. A missed fill is real money, and the head of trading does not care that the cloud provider’s SLA was technically met. The mandate: own the streaming layer, get the p99 down and keep it there, prove it with numbers, and do it without hiring a five-person Kafka operations team. This article is the reference architecture for that — a self-managed, self-healing Apache Kafka platform on Kubernetes, operated by Strimzi, tuned for a low-latency trading feed, and instrumented so the firm’s CTO can see the SLO is holding in real time.
The pressures in trading stack differently from a typical enterprise. Latency is the product, not a feature — the entire reason to leave a managed service is that you cannot tune a broker you do not control, cannot pin it to the right instance type, and cannot eliminate the noisy-neighbor throttle that costs you fills. Durability still matters because the order journal is a regulated record under MiFID II / SEC 17a-4 retention, so “fast but lossy” is not on the table. Availability means surviving the loss of an Availability Zone mid-session without a trading halt. And operational leverage means a two-person platform team has to run this, which is the single fact that makes Strimzi — not a hand-rolled Kafka install — the only sane answer.
Why self-manage, and why an operator
The honest first question is whether to self-manage Kafka at all. Managed Kafka (MSK, Confluent Cloud) is the right default for most teams, and naming why this firm leaves it matters because someone will rightly challenge the decision.
The firm leaves managed Kafka for exactly three reasons that do not apply to a normal SaaS backend. First, broker placement and instance type — a low-latency feed wants brokers pinned to specific high-clock, network-optimized instances in the same AZ as the strategy engines, with local NVMe for the commit log, and a managed service abstracts that away. Second, kernel and JVM tuning — the page-cache behavior, the vm.dirty_ratio, the G1 pause targets, the NIC interrupt affinity all matter at single-digit-millisecond p99, and you cannot touch them on a managed broker. Third, throttle control — managed services protect themselves with quotas that surface as latency spikes precisely at the open and close, which is the worst possible time for this workload.
But self-managing Kafka the old way — Ansible playbooks, hand-rolled failover runbooks, 3 a.m. pages to manually reassign partitions when a broker dies — is exactly the five-person-team cost the firm is trying to avoid. Strimzi resolves the contradiction. It is a Kubernetes operator that turns Kafka into declarative custom resources: you describe the cluster you want (Kafka, KafkaNodePool, KafkaTopic, KafkaUser) in YAML, and the operator continuously reconciles reality to match — provisioning brokers, rolling upgrades one broker at a time with controlled-shutdown, regenerating TLS certs before they expire, and recovering a failed broker onto a healthy node without a human in the loop. The firm gets the control of self-managed Kafka with something close to the operational cost of a managed one. The whole platform runs on Amazon EKS so the operator, the brokers, and the firm’s own tooling share one control plane and one IAM story.
Architecture overview
The platform has two planes that share infrastructure but live on different timescales: a hot data plane — producers slamming market data and order events through brokers to consumers, where every microsecond is measured — and a control plane — the Strimzi operator, certificate rotation, MirrorMaker, and observability, which runs continuously but off the critical latency path. Keeping these separate in your head is the first step to operating this well.
The defining property of the topology is rack awareness mapped to AWS Availability Zones. The Kafka cluster runs across three AZs in one region. Strimzi labels each broker with its zone via the broker.rack config sourced from the node’s topology.kubernetes.io/zone label, and Kafka’s rack-aware replica placement then guarantees that the replicas of any partition land in different AZs. Lose an entire AZ and every partition still has a surviving in-sync replica in another zone — no data loss, no trading halt, just a leader election that completes in well under a second.
Hot path, following the data flow:
- Producers — the market-data gateway that normalizes the exchange feed, and the order-management system that journals every order state transition — run as pods in the same AZ as the partition leaders they write to, so the hot write path never crosses a zone boundary and never pays the inter-AZ network hop. They connect over the internal listener with mTLS.
- The write lands on the partition leader broker. For the order journal, the producer runs with
acks=alland the topic hasmin.insync.replicas=2across a replication factor of 3 — the write is only acknowledged once a second AZ has it, which is what makes the journal durable enough for a regulated record. For the ephemeral market-data fan-out, where a dropped tick is re-sent on the next quote and durability is worthless, producers runacks=1to shave the acknowledgement latency. - Kafka uses KRaft mode — no ZooKeeper — so cluster metadata and leader election live in a dedicated quorum of controller nodes, removing an entire failure-prone dependency and cutting failover time. Strimzi provisions the controllers as their own
KafkaNodePool, isolated from the brokers that carry the hot data. - Consumers — the strategy engines, the real-time risk aggregator, the order-state cache — read from the leader (or from a rack-local follower via follower fetching, so a consumer reads from a replica in its own AZ and avoids the cross-zone read hop). They commit offsets back to Kafka.
- The commit log for the latency-sensitive topics sits on local NVMe instance storage on the broker nodes for the lowest possible write latency, with tiered storage offloading older closed log segments to Amazon S3 so the brokers keep only the recent hot data on fast local disk and the long retention required for compliance lives cheaply in object storage.
Control plane, continuous and off the hot path: the Strimzi Cluster Operator watches the custom resources and reconciles the cluster. MirrorMaker 2 (run by Strimzi) replicates the order journal and other durable topics to a second AWS region for disaster recovery. Prometheus scrapes JMX metrics from every broker via the Strimzi metrics exporter, and Grafana renders the latency SLO dashboards the desk and the CTO watch.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| Cluster orchestration | Amazon EKS | Runs the operator, brokers, controllers, MirrorMaker | Dedicated node groups; AZ-spread; cluster autoscaler |
| Kafka lifecycle | Strimzi operator | Declarative provisioning, rolling upgrades, cert rotation, self-heal | Kafka + KafkaNodePool CRs; KRaft; rack from node zone label |
| Brokers | Kafka (KRaft) | Hot data plane: log append, replication, leader election | RF=3, min.insync.replicas=2; rack-aware placement |
| Controllers | Kafka KRaft quorum | Cluster metadata + leader election (no ZooKeeper) | Separate node pool; 3 controllers across AZs |
| Hot log storage | Local NVMe (instance store) | Lowest-latency commit log for the recent window | i-family nodes; per-broker JBOD; XFS |
| Cold log storage | Tiered storage → Amazon S3 | Long retention for compliance, cheaply | remote.storage.enable; local retention hours, remote retention years |
| Cross-region DR | MirrorMaker 2 | Async replicate durable topics to a paired region | Strimzi KafkaMirrorMaker2 CR; offset + ACL sync |
| Transport security | mTLS (Strimzi-managed CA) | Encrypt + mutually authenticate every client and broker | Cluster + clients CA; auto-renew; TLS internal listener |
| Authorization | Kafka ACLs via KafkaUser |
Per-service topic permissions, least privilege | KafkaUser CRs; StandardAuthorizer; deny by default |
| Human identity | Okta + Entra ID | SSO to Grafana, EKS console, the platform tooling | OIDC to EKS; SAML to Grafana; conditional access |
| Secrets | HashiCorp Vault | App credentials, MirrorMaker peer creds, signing keys | IRSA-backed auth; dynamic leases; Agent sidecar injection |
| Cloud posture | Wiz / Wiz Code | CSPM on EKS + S3, IaC scanning of the Terraform/Helm | Agentless scan; alerts on public S3 or open security group; Wiz Code in CI |
| Runtime security | CrowdStrike Falcon | Runtime threat detection on broker and operator nodes | Sensor as DaemonSet; detections to the SOC |
| Observability | Prometheus + Grafana + Dynatrace | JMX metrics, latency SLO dashboards, full-stack tracing | Strimzi JMX exporter; Davis anomaly detection; SLO alerts |
| ITSM | ServiceNow | Change approvals for cluster changes, incident records | Change gate before a broker config change; auto-ticket on SLO breach |
| CI / IaC | GitHub Actions + Argo CD + Terraform | Build/test, GitOps deploy of CRs, infra as code | OIDC to AWS; Argo CD syncs Strimzi CRs; Terraform for EKS/VPC/S3 |
A few of these choices deserve the why, because they are the ones teams get wrong.
Why tiered storage instead of just big local disks. A trading firm’s order journal must be retained for years, but the brokers only ever serve the recent window at low latency. Putting years of cold segments on local NVMe is ruinously expensive and forces enormous brokers whose recovery (re-replicating a full disk after a node loss) takes hours. Tiered storage keeps only a few hours of hot data on local NVMe and transparently offloads closed segments to S3; a broker’s local footprint stays small, so recovery is fast, while retention is effectively unlimited and cheap. The trade is that a consumer reading far back in history pays an S3 fetch — irrelevant for a real-time desk, acceptable for the occasional compliance replay.
Why KRaft, not ZooKeeper. ZooKeeper was a second distributed system to operate, secure, and recover, and its loss could freeze the cluster. KRaft folds metadata management into a Kafka controller quorum, removing that dependency entirely, shrinking failover time during a broker loss, and simplifying the security surface — one fewer thing for a two-person team to run and for Wiz to have to assess.
Why mTLS everywhere, not just at the edge. In a flat Kubernetes network, any compromised pod can reach a broker port. mTLS means every producer and consumer presents a client certificate the broker verifies, so an unauthorized pod cannot even open a session, and all traffic is encrypted in transit — table stakes for order data. Strimzi runs its own certificate authority, issues per-client certs through the KafkaUser resource, and rotates them automatically before expiry, which removes the classic self-managed-Kafka footgun of a cluster-wide outage when a hand-managed cert silently expires.
Implementation guidance
Provision the substrate with Terraform, then hand the cluster to GitOps. The split matters: Terraform owns the slow-moving cloud substrate (VPC across three AZs, the EKS cluster, node groups on the right instance families, the S3 tiered-storage bucket, IAM/IRSA roles), and Argo CD owns the fast-moving Kafka definition (the Strimzi CRs), so a topic or broker-config change is a reviewed Git commit that Argo syncs — never a kubectl apply from someone’s laptop.
A trimmed Kafka custom resource communicates the intent — KRaft, rack-aware, tuned listeners:
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: trading-bus
annotations:
strimzi.io/node-pools: enabled
strimzi.io/kraft: enabled
spec:
kafka:
replicas: 6
rack:
topologyKey: topology.kubernetes.io/zone # spread replicas across AZs
listeners:
- name: tls
port: 9093
type: internal
tls: true
authentication:
type: tls # mTLS for every client
config:
default.replication.factor: 3
min.insync.replicas: 2
offsets.topic.replication.factor: 3
replica.selector.class: org.apache.kafka.common.replica.RackAwareReplicaSelector # follower fetch
And the broker node pool, pinned to local-NVMe instances with JBOD storage:
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: brokers
labels:
strimzi.io/cluster: trading-bus
spec:
replicas: 6
roles: [broker]
storage:
type: jbod
volumes:
- id: 0
type: ephemeral # local NVMe instance store for the hot commit log
sizeLimit: 800Gi
The CI that applies this runs in GitHub Actions, authenticating to AWS via OIDC federation so there is no stored access key to leak, and Wiz Code scans the Terraform and Helm/CR manifests in the pull request — flagging a publicly readable S3 bucket or an over-broad security group before it ever reaches a cluster. Argo CD then reconciles the merged CRs onto EKS.
Tune for the trading workload, because defaults are not for you. Pin brokers to network-optimized, local-NVMe instances (an i-family or comparable) so the commit log writes to local disk and the NIC has headroom for replication traffic. Set the JVM heap modestly (Kafka leans on the OS page cache, not a huge heap) with G1 pause targets tuned low. Increase num.network.threads and num.io.threads for the fan-out, and align producers and the partition leaders into the same AZ so the hot write path never crosses a zone. Use follower fetching (the RackAwareReplicaSelector above) so consumers read from an in-zone replica and avoid the cross-AZ read hop and its dollar cost. These are the knobs a managed service hides and the reason the firm left.
Configure tiered storage explicitly per topic. Enable remote storage on the durable topics and set a short local retention with a long remote one, so hot data stays on NVMe and cold data lands in S3:
remote.storage.enable=true
local.retention.ms=10800000 # 3 hours hot on local NVMe
retention.ms=220903200000 # ~7 years total, the tail living in S3
Enterprise considerations
Security & Zero Trust. The platform is Zero Trust by construction inside the cluster: mTLS authenticates and encrypts every client-broker session, and Kafka ACLs declared through KafkaUser resources enforce least privilege per service — the market-data gateway can Write only to the quote topics, a strategy engine can Read only the feeds it subscribes to, and everything else is denied by default. Layer on top: (a) HashiCorp Vault holds the few real secrets — the MirrorMaker peer-cluster credentials, third-party feed tokens, signing keys — leased dynamically and injected by the Vault Agent sidecar with IRSA-backed auth, so nothing sensitive sits in a plain Kubernetes Secret; (b) Wiz runs continuous CSPM across EKS and S3 and attack-path analysis, alerting the moment the tiered-storage bucket drifts toward public exposure or a node’s security group widens, with Wiz Code as the shift-left check in CI; © CrowdStrike Falcon sensors run as a DaemonSet on every broker and operator node for runtime threat detection, feeding the firm’s SOC; (d) human access to Grafana, the EKS console, and the platform tooling federates through Okta (brokered to Entra ID where Azure-side tooling needs a native token), so engineers authenticate once under conditional access rather than sharing a kubeconfig. An SLO breach or a Falcon detection auto-raises a ServiceNow incident so there is a ticket and an audit trail, not just a log line.
Cost optimization. Self-managed is not automatically cheaper; engineer it to be.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Tiered storage | Cold segments to S3 instead of giant local disks | Slashes per-broker storage and shrinks recovery time |
| Right-sized brokers | Pin to the instance family that fits IO + network, no more | Avoids paying for idle CPU on oversized nodes |
| In-AZ traffic alignment | Producers/consumers read-write in-zone via follower fetch | Cuts inter-AZ data-transfer charges, often a top-3 line item |
acks per topic |
acks=1 for ephemeral feeds, acks=all only where durable |
Lower replication overhead on the high-volume fan-out |
| Spot for non-critical | MirrorMaker / batch consumers on Spot, brokers on On-Demand | Cheaper control-plane and consumer compute |
The inter-AZ alignment lever is the one teams forget: a chatty cross-zone consumer can quietly make data-transfer the largest bill on the platform, and follower fetching plus AZ-pinned producers is what keeps it down.
Scalability. Each tier scales independently. Brokers scale out by raising replicas on the KafkaNodePool; Strimzi provisions the new broker and you rebalance partitions onto it with Cruise Control (which Strimzi integrates) rather than a hand-built reassignment plan. Topic throughput scales with partition count — size it for the consumer parallelism the strategy engines need, since a partition is the unit of consumer concurrency. EKS node groups scale via the cluster autoscaler. The natural ceiling on a single cluster is metadata and replication overhead, which is why a firm at real scale shards by domain (a market-data cluster separate from the order-journal cluster) rather than one mega-cluster.
Failure modes, and what each one looks like. Name them before they page you.
- An AZ goes down mid-session — every partition led from that zone needs a new leader. Because replicas are rack-spread, a surviving in-sync replica in another AZ is elected in under a second; the desk sees a sub-second blip, not a halt. Mitigation: RF=3 across three AZs with
min.insync.replicas=2, verified by periodic AZ-failure game days. - A broker dies — Strimzi reschedules it onto a healthy node, and because tiered storage keeps the local footprint small, re-replication of the hot window finishes in minutes, not hours. Mitigation: tiered storage and Cruise Control-driven rebalancing.
- A cert silently expires — the classic self-managed outage. Mitigation: Strimzi’s CA auto-renews cluster and client certs ahead of expiry; an alert fires if renewal lags.
- Under-replicated partitions climb — replication is falling behind, the early warning of a saturated broker or network. Mitigation: alert on
UnderReplicatedPartitions > 0and on ISR shrink, long before it becomes data risk. - Consumer lag blows out at the open — a strategy engine cannot keep up, so it acts on stale prices. Mitigation: alert on
records-lag-maxper consumer group against an SLO, scale partitions and consumer instances, and pre-warm before the open. - Regional outage — see DR below.
Reliability & DR (RTO/RPO). Decide the numbers per tier. Within the region, the rack-aware three-AZ layout gives zero data loss and sub-second recovery for an AZ failure. For a full regional outage, MirrorMaker 2 asynchronously replicates the durable topics (the order journal above all) to a paired AWS region, syncing both the data and the consumer-group offsets so a failed-over consumer resumes near where it stopped. Async replication means the cross-region RPO is the replication lag — typically seconds — and the RTO is how fast you repoint producers and consumers at the DR cluster, which a runbook plus DNS makes minutes. A pragmatic target for this platform: in-region RTO under one minute and RPO zero; cross-region RTO ~15 minutes and RPO seconds. The order journal’s S3-backed tiered storage is the durable backstop — even a lost cluster is rebuildable from object storage.
Observability and the SLO contract. This is where self-managing earns its keep, because the whole project is justified by proving latency. The Strimzi JMX exporter exposes every broker metric to Prometheus, and Grafana renders the dashboards the desk and CTO watch. The SLOs are explicit and alerted:
| SLO | Metric | Target |
|---|---|---|
| Producer ack latency (order journal) | producer request-latency p99 |
< 5 ms |
| End-to-end feed latency | produce-to-consume p99 | < 8 ms |
| Consumer lag (strategy engines) | records-lag-max |
< 1000 records |
| Replication health | UnderReplicatedPartitions |
0 |
| Availability | broker / controller uptime | 99.99% |
Dynatrace sits above the Kafka-native metrics with full-stack distributed tracing and Davis anomaly detection, correlating a producer-side latency spike to a node, a network event, or a GC pause so the two-person team gets a root cause, not just a red graph — and surfacing a regression on its own before it trips the SLO. A sustained SLO breach auto-raises a ServiceNow incident.
Governance. Pin the Kafka and Strimzi versions explicitly and promote upgrades through a staging cluster — Strimzi rolls a version change one broker at a time with controlled shutdown, but you still gate it. Keep every KafkaTopic, KafkaUser, and broker config in Git, reviewed and instantly revertable, with Argo CD as the single applier so the live cluster never drifts from the repo. Route cluster changes through a ServiceNow change approval for an audit trail, and rely on Wiz as the independent check that the posture (no public S3, mTLS on, ACLs enforced) is actually holding.
Explicit tradeoffs
Accept these or do not build it. Self-managing Kafka — even with Strimzi doing the heavy lifting — means you now own broker tuning, capacity planning, partition rebalancing, and the on-call for a stateful distributed system, which a managed service carried for you. Strimzi shrinks that load dramatically but does not erase it: someone still has to understand KRaft quorums, ISR dynamics, and why a partition went under-replicated. Tiered storage adds an S3 dependency and a small fetch-latency cliff for historical reads. The rack-aware, in-AZ-aligned topology that delivers the latency and the data-transfer savings is more design effort than a single-zone cluster, and getting follower fetching and producer placement right is fiddly. And running this on EKS means the platform team owns both Kubernetes and Kafka — two deep systems — which is only worth it because the operator makes the Kafka half tractable for a small team.
The alternatives, and when they win. If latency is not your product and you just need durable streaming, managed Kafka (MSK or Confluent Cloud) is the right default — less to operate, and you give up exactly the tuning this firm needed. If your workload is simple queue-and-fan-out rather than a high-throughput replayable log, a managed message queue (SQS/SNS, or a lighter broker) is simpler than any Kafka at all. If you need stream processing on top — windowed aggregations, joins — add Kafka Streams or Flink as consumers rather than pushing logic into the brokers. And if you are a small team without Kubernetes expertise, running Kafka on dedicated VMs with Ansible trades the EKS learning curve for a more manual operational model — viable, but it gives back much of the self-healing that made Strimzi worth choosing.
The shape of the win
For the trading desk, the payoff is not “we run our own Kafka.” It is that during the open and the close — the moments that decide the firm’s P&L — the produce-to-consume p99 sits under eight milliseconds on a Grafana panel the head of trading can see, an AZ can fail without a single missed fill, the order journal is provably durable and retained for the regulator, and a two-person platform team runs the whole thing because Strimzi handles the broker lifecycle, certificate rotation, and self-healing that used to demand a dedicated ops squad. That combination — managed-service-level operational cost with self-managed-level control over latency — is the one that justifies the build. Everything upstream — the rack-aware AZ spread, the KRaft controllers, the tiered storage to S3, the mTLS and ACLs, the Vault-held secrets, the Wiz posture scanning, the Dynatrace anomaly detection, the MirrorMaker DR — exists so the desk gets its fills and the CTO can prove the SLO is holding. The architecture here is the destination; start with a single durable cluster if you must, but a regulated, latency-critical trading feed is where self-managed Kafka on Kubernetes has to land.