A national parcel carrier — think the kind that moves nine million packages a day across a hub-and-spoke network of sortation centers, line-haul trucks, and last-mile vans — gets a directive from its COO after a peak-season meltdown: the business cannot see its own packages in real time. A scan in a sortation center took up to forty minutes to surface in the customer tracking page, the fraud team learned about a stolen-load pattern from a spreadsheet the next morning, and the data warehouse that finance runs margin on was a day behind reality. Each of these consumers — tracking, fraud, finance, the route-optimization ML team — had been bolted directly onto the sortation system’s database with its own nightly extract, and the database was buckling under the read load while every team blamed every other team. The ask is deceptively simple: “one place every package event lands, and everyone reads from it without stepping on each other.” The constraint is the usual enterprise reality — this runs on AWS across three accounts and two regions, the security team will not allow a managed data service to talk to anything over the public internet, every consumer must be governed and audited, and a malformed event from one upstream team cannot be allowed to poison every downstream reader. This article is the reference architecture for that event backbone, built on Confluent Cloud so the carrier operates streams instead of operating Kafka.
The pressures stack the way they always do in operations-heavy businesses. Volume is non-negotiable: scan, sort, depart, and deliver events run to tens of thousands per second at peak, with a Black-Friday-to-Christmas surge that triples the floor. Decoupling is the actual problem to solve — the failure was that producers and consumers were welded together through a shared database, so every new consumer added load and every schema change broke something. Freshness means tracking and fraud need events in seconds, not the forty minutes that triggered the directive. And governance means every topic, every producer, and every consumer must map to an identity and a contract the security and data teams can audit. An event-streaming backbone — a durable, replayable log that producers append to and many independent consumers read from at their own pace — satisfies all four at once. The log is the source of truth; producers append, consumers subscribe, and neither knows the other exists.
Why not the obvious shortcuts
The naive fixes each fail predictably, and naming why matters because someone on the project will propose all three.
Point-to-point integration — wiring each new consumer straight to the sortation database or to each producer — is exactly the mess that caused the meltdown. With N producers and M consumers you trend toward N×M brittle connections, every schema change is a coordinated outage, and the database dies under the combined read load. A plain SQS/SNS fan-out decouples senders from receivers but throws away the two properties that matter most here: it is not a replayable log, so a consumer that was down or a new consumer that joins next quarter cannot rewind and reprocess history, and ordering and partitioned high-throughput semantics are weak. Self-managing Apache Kafka on EC2 or MSK gives you the log, but now the carrier’s small platform team owns broker rebalancing, partition reassignment, version upgrades, Schema Registry, connector runtimes, and 3 a.m. ISR-shrinking pages — undifferentiated heavy lifting for a company whose differentiation is moving boxes, not running distributed consensus.
Confluent Cloud threads the needle. It is Kafka — the durable, partitioned, replayable commit log with strong ordering per partition and consumer-group fan-out — delivered as a managed service, so the carrier gets the log semantics without operating the cluster. Around the raw log it adds the parts every enterprise eventually has to build anyway: a Schema Registry to enforce data contracts, fully managed connectors to land data in S3 and Snowflake without standing up Kafka Connect, ksqlDB / Flink for in-stream processing, and RBAC that binds topic access to corporate identity. The platform team operates streams and contracts; Confluent operates the brokers.
Architecture overview
The backbone runs as a logical event bus that physically lives in Confluent’s AWS account and reaches into the carrier’s VPCs over private networking only. Keeping three planes separate in your head is the first step to operating this well: a produce path where operational systems append events, a process path where streams are joined and enriched in flight, and a consume/sink path where downstream systems and analytics platforms read.
The defining property of the entire topology is the one the security team cares about most: every byte between the carrier’s VPCs and Confluent Cloud rides AWS PrivateLink, and there is no public bootstrap endpoint. The Confluent cluster is a Dedicated cluster exposed into each consuming VPC through a PrivateLink endpoint, so brokers resolve to private IPs inside the carrier’s subnets. No package event, no consumer offset commit, and no Schema Registry call ever touches the public internet — which is what makes the security story defensible across the carrier’s multi-account AWS estate.
Produce path, following the data flow:
- Scanners in sortation centers, handheld devices on vans, and the line-haul telematics system publish to a producer-facing service running on EKS in the operations VPC. That service is the only thing the edge devices talk to; it owns batching, retries, and idempotent produce so a flaky cellular link on a delivery van does not create duplicate scans.
- The producer authenticates to Confluent over PrivateLink using a service-account API key whose secret is never baked into the pod. It is issued and rotated by HashiCorp Vault via the Vault Agent sidecar with Kubernetes auth, so the credential is short-lived and never sits in a Kubernetes Secret or environment variable.
- Before a single byte is written, the producer’s serializer checks the event against the Schema Registry. Each topic —
package.scanned,package.sorted,truck.departed,package.delivered— has a registered Avro schema and a compatibility mode. A producer that tries to send a record violating the contract is rejected at serialization time, in its own process, so a bad release from one upstream team cannot poison the log every other team reads. - The event lands in a topic, partitioned by tracking number so that all events for a given package are ordered and land on the same partition — the property that makes per-package state machines correct downstream.
Process path, in-stream and continuous: a ksqlDB application (graduating to Flink for the heavier joins) reads the raw event topics and does the work that used to require a nightly batch. It maintains a per-package state from the ordered event stream, joins package.scanned against a reference facility table to enrich each scan with region and service-level, and detects the stolen-load pattern the fraud team used to find a day late — a high-value package that scans departed from a hub but never scans arrived at the next, within an SLA window — emitting to a fraud.alert topic in seconds. Derived topics like package.status.latest are themselves first-class topics other consumers subscribe to.
Consume / sink path, each reader independent and at its own pace:
- The customer tracking service (its own EKS deployment, its own consumer group) subscribes to
package.status.latestand serves the tracking page from a fast read store. Because it is a Kafka consumer group, it reads at its own pace and a slow tracking deploy never backpressures the sortation producers. - A fully managed S3 Sink connector continuously lands every raw event into the carrier’s data-lake bucket in Parquet, partitioned by date — the durable, replayable archive of record and the cheap tier for the ML team’s training data.
- A fully managed Snowflake Sink connector streams the same events into Snowflake via Snowpipe Streaming, so finance’s margin models and the analytics team run on data that is seconds old instead of a day behind — the exact gap that put this project on the COO’s desk.
- The route-optimization ML team runs its own consumer group off the enriched topics, with no coordination needed and zero load added to any operational database.
Component breakdown
| Component | Service / tool | Role in the backbone | Key configuration choices |
|---|---|---|---|
| Streaming platform | Confluent Cloud (Dedicated) | Managed Kafka log: durable, partitioned, replayable event bus | Dedicated cluster (CKUs sized to throughput); multi-AZ; tiered storage on |
| Private networking | AWS PrivateLink | Private-only reach from each VPC to the cluster | PrivateLink endpoint per consuming VPC; no public bootstrap; Route 53 private hosted zone |
| Data contracts | Schema Registry | Enforce Avro schemas + compatibility per topic | BACKWARD compatibility default; broker-side validation; schema IDs in records |
| Producers | EKS producer services | Idempotent, batched ingest from edge devices | enable.idempotence=true; acks=all; partition by tracking number |
| Stream processing | ksqlDB / Confluent Flink | In-stream enrich, per-package state, fraud detection | Stateful joins; tumbling/hopping windows; derived topics |
| Lake sink | Managed S3 Sink connector | Land all events to the data lake as Parquet | Parquet format; time-based partitioning; exactly-once delivery |
| Warehouse sink | Managed Snowflake Sink connector | Stream events into Snowflake for analytics/finance | Snowpipe Streaming; schema-evolution on; per-topic table mapping |
| Identity / SSO | Okta + Confluent RBAC | Workforce SSO and identity-bound topic authorization | SSO via Okta OIDC; group-to-role mapping; service accounts for apps |
| Secrets | HashiCorp Vault | Issue/rotate Confluent API keys and connector creds | Kubernetes auth; dynamic short-lived keys; Vault Agent sidecar |
| CSPM / data posture | Wiz + Wiz Code | Cloud posture, exposure, IaC scanning of the streaming estate | Agentless scan of VPCs/S3/IAM; Wiz Code gates Terraform PRs |
| Runtime security | CrowdStrike Falcon | Runtime threat detection on EKS nodes and producer/sink compute | Falcon sensor on node groups; detections to the SOC |
| Observability | Datadog | Cluster, consumer-lag, and connector telemetry; tracing | Confluent Cloud integration; lag monitors; APM on EKS consumers |
| ITSM / change | ServiceNow | Topic onboarding approvals, schema-change CRs, incident records | Change gate before a new topic/connector ships; auto-ticket on lag breach |
| CI / IaC | GitHub Actions + Terraform | Provision cluster, topics, RBAC, connectors as code | OIDC to AWS + Confluent provider; eval/lint gate before apply |
A few of these choices deserve the why, because they are the ones teams get wrong.
Why partition by tracking number, not round-robin. Kafka guarantees ordering only within a partition, not across the topic. If package.scanned events were spread round-robin, the per-package state machine in ksqlDB could see “delivered” before “out for delivery” and compute nonsense. Keying every record by tracking number guarantees all events for one package land on one partition in order, which is what makes “current status” correct and the fraud window logic sound. The cost is the standard one — a single mega-customer’s tracking number could create a hot partition — mitigated by the fact that tracking numbers are naturally high-cardinality, so the load spreads evenly.
Why the Schema Registry is the contract, not documentation. The original meltdown’s root cause was implicit contracts: producers and consumers agreed on a shape by convention, and a convention drifts. Registering an Avro schema per topic with a compatibility mode turns the contract into something the platform enforces. With BACKWARD compatibility, a producer may add an optional field but cannot rename or drop one that consumers depend on, and the registry refuses the incompatible schema at registration. The data contract stops being a wiki page someone forgot to update and becomes a gate in the pipeline:
# A producer's serializer validates against the registered subject before producing.
# An incompatible change is rejected here, in CI, not discovered in prod by a broken consumer.
$ curl -s "$SR_URL/compatibility/subjects/package.scanned-value/versions/latest" \
--data @new-schema.json -H "Content-Type: application/json"
{"is_compatible": false} # CI fails the PR; the contract holds
Why managed connectors instead of self-run Kafka Connect. Landing to S3 and Snowflake reliably with exactly-once semantics, offset management, and schema evolution is a service in its own right. Running it yourself means a Connect cluster to size, patch, and page on. Confluent’s fully managed S3 and Snowflake Sink connectors make that someone else’s on-call: you declare the connector in Terraform, point it at the topics, and Confluent runs the workers, handles backpressure, and guarantees delivery. The tradeoff is less low-level control and a per-connector cost — worth it for a platform team that should be building data contracts, not operating connector runtimes.
Implementation guidance
Provision with Terraform, and treat the network as the first deliverable. The deployment order matters because of private DNS — get it wrong and the bootstrap endpoint resolves to nothing and every client hangs on connect.
- The Dedicated Confluent cluster in the carrier’s AWS region, sized in CKUs to peak throughput, multi-AZ, with tiered storage enabled so long retention is cheap.
- A PrivateLink connection from Confluent into each consuming VPC, with a Route 53 private hosted zone so the bootstrap and broker hostnames resolve to the private endpoint inside every account. Forgetting the private hosted zone in one VPC is the single most common failure on this architecture — the cluster is up, but that account’s clients time out.
- Topics with explicit partition counts and retention, and their schemas registered with compatibility modes — both as code, reviewed in pull requests.
- Service accounts and RBAC role bindings per application, scoped to exactly the topics each one needs.
- The managed connectors to S3 and Snowflake, declared and versioned alongside everything else.
A minimal Terraform shape communicates the intent — a dedicated cluster, a contract-bound topic, and least-privilege access:
resource "confluent_kafka_cluster" "backbone" {
display_name = "parcel-event-backbone-prod"
availability = "MULTI_ZONE"
cloud = "AWS"
region = "us-east-1"
dedicated { cku = 4 } # sized to peak; scale CKUs for surge
environment { id = confluent_environment.prod.id }
}
resource "confluent_kafka_topic" "package_scanned" {
topic_name = "package.scanned"
partitions_count = 48 # headroom for consumer parallelism
kafka_cluster { id = confluent_kafka_cluster.backbone.id }
config = { "retention.ms" = "604800000" } # 7-day replay window
}
# Tracking service reads only what it needs — least privilege, by identity.
resource "confluent_role_binding" "tracking_reader" {
principal = "User:${confluent_service_account.tracking.id}"
role_name = "DeveloperRead"
crn_pattern = "${confluent_kafka_cluster.backbone.rbac_crn}/kafka=${confluent_kafka_cluster.backbone.id}/topic=package.status.latest"
}
The pipeline that applies this runs in GitHub Actions, authenticating to AWS via OIDC federation and to Confluent via a scoped API key from Vault, so there is no long-lived cloud secret stored in the CI system — a hard lesson the platform team intends never to repeat. The same pipeline runs schema-compatibility checks and a Terraform plan review as required gates, with Argo CD reconciling the EKS-side producer and consumer deployments from Git so the application layer is as declarative as the infrastructure.
Identity: bind every topic to an identity. Two distinct flows exist. Human access federates through Okta as the workforce IdP into Confluent via OIDC SSO, and Okta group membership maps to Confluent RBAC roles — a data engineer in the kafka-platform group gets ClusterAdmin on the non-prod environment and read-only on prod, while a fraud analyst gets DeveloperRead on fraud.alert and nothing else. Application access uses service accounts, one per producer or consumer, each with RBAC bindings scoped to precisely the topics it touches — the tracking service can read package.status.latest and write nothing; the sortation producer can write package.scanned and read nothing. The API keys those service accounts authenticate with are issued and rotated by HashiCorp Vault, leased short and injected by the Vault Agent sidecar, so a leaked key is both narrowly scoped and quickly stale.
Schema-change discipline. Treat schemas as the most important code in the repo. Default every subject to BACKWARD compatibility; require additive-only changes; run the registry’s compatibility check in CI so an incompatible change fails the pull request, never a consumer at 2 a.m. When a genuinely breaking change is unavoidable, version the topic (package.scanned.v2) and migrate consumers deliberately rather than mutating a live contract.
Enterprise considerations
Security & Zero Trust. The architecture is Zero Trust by construction: private-only networking with no public broker surface, identity-bound RBAC on every topic, and least-privilege service accounts. Layer on top: (a) Wiz running continuous CSPM across the carrier’s AWS accounts — flagging any S3 sink bucket that drifts to public, any over-broad IAM role, any security-group hole into the streaming subnets — with Wiz Code scanning the Terraform in pull requests so a misconfiguration is caught before it is ever applied; (b) CrowdStrike Falcon sensors on the EKS node groups running producers, consumers, and ksqlDB workloads, feeding runtime detections to the carrier’s SOC; © encryption in transit over PrivateLink/TLS and at rest in the cluster and in S3, with the option of customer-managed keys for the regulated bits; (d) a lag breach or a rejected-schema spike auto-raising a ServiceNow incident so the platform team gets a ticket, not just a Datadog blip. Producers and connectors authenticate as principals, never as a shared cluster-wide key, so a compromised van-device service cannot read the fraud topic.
Cost optimization. Streaming cost is driven by cluster capacity, throughput, storage, and connectors, and it grows with adoption — so engineer for it from day one.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Tiered storage | Offload old segments to object storage; keep brokers lean | Long retention at a fraction of broker-disk cost |
| Right-size CKUs + surge | Run steady-state CKUs; scale up for peak season, back down after | Avoids paying peak capacity year-round |
| Compression | compression.type=zstd on producers |
Cuts throughput-based and storage cost materially |
| Sink to S3 for cold reads | Replay/history from the lake, not from a fat broker retention | Shrinks the expensive in-cluster retention window |
| One backbone, many readers | Reuse topics across consumers instead of per-consumer extracts | Removes the N pipelines the old architecture paid for |
The single biggest saving is structural: the old world ran a separate nightly extract per consumer, each with its own compute and its own copy. One governed backbone with many consumer groups collapses all of that into a single produce-once, read-many log. Meter throughput and storage in Datadog and pipe it to the chargeback dashboard the CFO sees, attributed per topic and per consuming team.
Scalability. Each plane scales independently. Cluster capacity scales by adding CKUs for the peak surge and shrinking after. Per-topic parallelism scales with partition count — size partitions to the maximum consumer parallelism you will ever want, because increasing partitions later changes key-to-partition mapping and disrupts ordering. Consumers scale horizontally by adding instances to a consumer group, up to the partition count. ksqlDB/Flink scales by adding streaming units, and the managed connectors scale their own task count. The natural ceiling to watch is hot partitions: a poor key choice, not raw volume, is what usually caps throughput.
Failure modes, and what each one looks like. Name them before they page you.
- A missing PrivateLink / private hosted zone in one account — the cluster is healthy but that VPC’s clients cannot resolve the bootstrap endpoint and hang on connect. Mitigation: assert the PrivateLink endpoint and Route 53 zone per account in Terraform, plus a post-deploy connectivity smoke test from each VPC.
- A poison message / bad schema — a producer bug emits records a consumer cannot deserialize, stalling that consumer group. Mitigation: Schema Registry rejects most of these at produce time; for the rest, a dead-letter topic on consumers and connectors so one bad record is quarantined, not a stop-the-world block.
- Consumer lag blowout — a downstream reader (or a sink connector) falls behind during the peak surge and the tracking page goes stale. Mitigation: Datadog consumer-lag monitors with paging thresholds, autoscaling on the EKS consumers, and enough partitions to add parallelism.
- Hot partition — a skewed key concentrates load on one broker/partition and throughput plateaus. Mitigation: high-cardinality keys (tracking number), and a documented re-keying playbook.
- Regional outage — see DR below.
Reliability & DR (RTO/RPO). Within a region, the Dedicated cluster is multi-AZ, so a single Availability Zone failure is transparent — replicas in other zones carry the load. For region loss, use Confluent Cluster Linking to mirror critical topics to a second region’s cluster, with consumers able to fail over and resume from mirrored offsets, and S3 (cross-region replication on the lake bucket) as the durable, replayable source of truth that can rebuild a region’s derived state. A pragmatic target for this backbone: RTO 30 minutes, RPO near zero for the core event topics via Cluster Linking, with the full history rebuildable from geo-replicated S3 if a downstream store is lost entirely. Decide these numbers per consumer — tracking needs fast failover; the ML training feed can tolerate a slower rebuild from the lake.
Observability. Instrument the backbone end to end in Datadog via the Confluent Cloud integration: cluster health, throughput, partition and broker metrics, and — the metric that actually predicts incidents — consumer-group lag per consumer and per connector, with paging thresholds. Add APM tracing on the EKS producers and consumers so a slow downstream is traceable to its hop, and emit the business-facing metrics that matter here: scan-to-tracking-page latency, fraud-alert detection latency, end-to-end freshness into Snowflake, and dead-letter rate. New topics and new connectors pass through a ServiceNow change approval before going live, giving the data-governance team a documented gate and an inventory of who produces and consumes what.
Governance. The Schema Registry plus RBAC is the governance layer, but make it explicit: maintain a catalog of every topic, its schema, its owning team, and its consumers — much of it derivable from the Confluent metadata, surfaced for the data office. Pin connector and schema versions in Terraform so nothing drifts; promote schema changes through the CI compatibility gate; tag topics carrying personal data (a recipient’s address rides package.delivered) so retention and a right-to-be-forgotten path are deliberate, since recipient data is personal data under the same regimes the carrier already answers to. Every RBAC binding and every schema change is in Git, reviewable and revertable.
Explicit tradeoffs
Accept these or do not build it. An event backbone adds real moving parts the point-to-point world did not have — schemas to govern, partitions to size correctly the first time, consumer lag to watch, and a streaming mental model the whole organization has to learn. Ordering is only guaranteed within a partition, so your keying strategy is a correctness decision, not a tuning knob, and getting it wrong is subtle and expensive to unwind. Going managed with Confluent Cloud trades raw control and the lowest-possible infrastructure bill for not operating Kafka — you accept a per-CKU and per-connector cost and a vendor relationship in exchange for handing broker rebalancing, upgrades, and connector runtimes to someone else. The private-networking posture that makes the security team sign costs you setup complexity — a PrivateLink endpoint and a private hosted zone per VPC, no public debugging shortcut — and the price of forgetting one piece is a silent connect hang, not a clear error.
The alternatives, and when they win. If your event volume is modest and you never need replay or strong ordering, SQS/SNS is simpler and cheaper and you should use it. If you have a large, skilled platform team and squeezing the infrastructure bill is worth real operational toil, self-managed Kafka or MSK gives you maximum control — and MSK plus the open-source ecosystem is a reasonable middle ground if you want AWS-native billing and can run Schema Registry and Connect yourself. If your need is a few services emitting occasional domain events rather than a high-throughput operational firehose, EventBridge with its schema registry and AWS-native routing may be the better fit. Confluent Cloud is the destination when the requirement is a governed, high-throughput, replayable backbone with first-class contracts and managed connectors and the team’s time is better spent on data products than on cluster operations.
The shape of the win
For the carrier, the payoff is not “we run Kafka now.” It is that a scan in a sortation center shows up on the customer’s tracking page in a couple of seconds, the fraud team gets a departed-but-never-arrived alert while the truck is still on the road instead of from a spreadsheet the next morning, finance runs margin on Snowflake data that is seconds old, and the ML team spins up a brand-new consumer next quarter without adding a single byte of load to any operational database or asking another team’s permission. Every team produces once and reads many — the exact inversion of the welded-together database that melted down at peak. Everything upstream — the PrivateLink-only networking, the Okta-federated RBAC, the Vault-issued keys, the Schema Registry contracts, the Wiz posture scanning, the Datadog lag monitors — exists to make the COO, the CISO, and the data office each say yes. The architecture here is the destination; start with the handful of topics that hurt most, but this is where a high-volume, governed “single source of event truth” has to land.