A payments company runs its order and settlement events through Kafka on Confluent Cloud in us-east-1. The mandate from the new platform lead is blunt: a regional outage cannot stop European settlement, and auditors want seven years of every settlement.completed event re-readable on demand — not in a cold archive nobody can replay, but as a live Kafka topic a consumer group can rewind into. The current setup keeps seven days of retention on local broker disk and replicates nothing across regions. This guide builds the fix: Cluster Linking to mirror topics from a us-east-1 source cluster into a eu-west-1 destination cluster with byte-for-byte offset fidelity, and Tiered Storage so both clusters offload old log segments to object storage and keep months-to-years of history without paying for hot broker disk. Everything below is real confluent CLI, Terraform, and config — run it top to bottom and you have a working multi-region, long-retention Kafka.
Prerequisites
- Two Confluent Cloud Dedicated clusters in the same environment, in different cloud regions (here:
us-east-1source,eu-west-1destination). Cluster Linking requires Dedicated; Tiered Storage is on by default for Dedicated and Enterprise clusters, but retention is what you tune. - The Confluent CLI v3.x+ installed and logged in (
confluent login), plus Terraform ≥ 1.6 with theconfluentinc/confluentprovider ≥ 1.60. - A Confluent Cloud API key/secret with
OrganizationAdminorEnvironmentAdminon the environment, used only to bootstrap; runtime service accounts get scoped keys. - Okta (or Microsoft Entra ID) as your workforce IdP, federated to Confluent Cloud via SSO so human operators authenticate with corporate identity and MFA — no shared logins to the Confluent Console.
- HashiCorp Vault reachable from your CI runners, to store the Confluent API key/secret and the per-service cluster credentials as short-lived leases rather than plaintext in a pipeline.
- A schema strategy decided up front: if you use Schema Registry, the destination region needs its schemas too (Schema Linking or a shared Stream Governance package).
Target topology
The shape is two Dedicated clusters and a one-way link between them. The source cluster in us-east-1 owns the writeable topics — producers only ever write here. A Cluster Link named payments-east-to-west runs as a managed component on the destination cluster in eu-west-1 and continuously pulls a configured set of topics into read-only mirror topics that preserve partition count, keys, and exact offsets. Each cluster independently runs Tiered Storage: recent log segments stay on broker disk for low-latency reads, and segments older than a small hotset threshold are offloaded to cloud object storage (S3/GCS/Azure Blob, managed by Confluent), so a topic can hold years of data while the brokers hold only days. Consumers in Europe read the mirror topics; if us-east-1 is lost, you promote the mirrors to writeable and fail producers over — and because offsets matched, consumer groups resume where they were. Around the data plane sit the operating tools: Terraform declares clusters/links/topics, Vault issues credentials, Okta/Entra gates humans, Jenkins/GitHub Actions plus Argo CD ship the config, Dynatrace/Datadog watch lag, ServiceNow records the failover change, and Wiz audits the cloud posture.
1. Authenticate, pin the environment, and stage credentials in Vault
Log in (SSO-backed, so the browser hands you off to Okta/Entra for MFA), then pin the environment so every later command is unambiguous.
confluent login --save # SSO -> Okta/Entra, MFA, token cached
confluent environment list
confluent environment use env-payments-prod # pin so we never act on the wrong env
Create a bootstrap Cloud API key and store it in Vault immediately — never leave it in shell history or CI logs. Vault holds it as a versioned secret that CI reads with a short-lived token, so the static Confluent secret is fetched at job time, not baked into a pipeline definition.
confluent api-key create --resource cloud --description "tf-bootstrap-payments"
# -> prints API Key + Secret ONCE. Pipe straight into Vault, do not echo to a file.
vault kv put secret/confluent/payments-bootstrap \
api_key="<KEY>" api_secret="<SECRET>"
In Terraform, the provider then reads those values from Vault rather than from terraform.tfvars:
data "vault_kv_secret_v2" "cc_bootstrap" {
mount = "secret"
name = "confluent/payments-bootstrap"
}
provider "confluent" {
cloud_api_key = data.vault_kv_secret_v2.cc_bootstrap.data["api_key"]
cloud_api_secret = data.vault_kv_secret_v2.cc_bootstrap.data["api_secret"]
}
2. Declare both Dedicated clusters in Terraform
Cluster Linking needs Dedicated clusters. Declare both regions and the environment in one config so the topology is reviewable and reproducible — this is the artifact Jenkins/GitHub Actions runs and Argo CD keeps in sync with the live state.
resource "confluent_environment" "payments" {
display_name = "payments-prod"
stream_governance { package = "ADVANCED" } # Schema Registry for both regions
}
resource "confluent_kafka_cluster" "source_us" {
display_name = "payments-source-use1"
availability = "MULTI_ZONE"
cloud = "AWS"
region = "us-east-1"
dedicated { cku = 2 }
environment { id = confluent_environment.payments.id }
}
resource "confluent_kafka_cluster" "dest_eu" {
display_name = "payments-dest-euw1"
availability = "MULTI_ZONE"
cloud = "AWS"
region = "eu-west-1"
dedicated { cku = 2 }
environment { id = confluent_environment.payments.id }
}
Apply through CI, not from a laptop:
terraform init && terraform plan -out tfplan
terraform apply tfplan # provisioning Dedicated CKUs takes a few minutes
3. Create service accounts and scoped, Vault-backed cluster credentials
Give producers, consumers, and the link their own identities — never reuse the bootstrap key. Each service account gets a cluster-scoped API key whose secret lands in Vault under a per-service path, so a leaked consumer key can be rotated without touching producers.
# identities
confluent iam service-account create sa-payments-producer --description "EAST producers"
confluent iam service-account create sa-payments-link --description "Cluster Link principal"
# scoped key for the producer, against the SOURCE cluster
confluent api-key create --service-account sa-payments-producer \
--resource lkc-source-use1
vault kv put secret/confluent/payments-producer api_key="<KEY>" api_secret="<SECRET>"
# least-privilege ACLs on the source: producers may only WRITE the payments topics
confluent kafka acl create --cluster lkc-source-use1 \
--allow --service-account sa-payments-producer \
--operations WRITE,DESCRIBE --topic settlement --prefix
The link’s own principal needs read access on the source so it can mirror; grant it READ/DESCRIBE on the topic prefix and DESCRIBE on the cluster. Keep this principal distinct so the audit trail shows exactly what the link copied.
4. Turn on Tiered Storage retention on the source topics
Dedicated clusters already tier segments to object storage; what you control per topic is how long data is retained and the hotset that stays on local disk. Create the topics with long retention.ms and a modest local hotset so brokers stay lean while the topic holds years.
# 7-year retention, but only ~last 6h of segments kept on broker disk (rest in object store)
confluent kafka topic create settlement.completed \
--cluster lkc-source-use1 \
--partitions 12 \
--config retention.ms=220752000000 \
--config confluent.tier.local.hotset.ms=21600000 \
--config cleanup.policy=delete
confluent kafka topic create order.events \
--cluster lkc-source-use1 --partitions 12 \
--config retention.ms=220752000000 \
--config confluent.tier.local.hotset.ms=21600000
The same in Terraform so it is declarative and drift-controlled:
resource "confluent_kafka_topic" "settlement" {
kafka_cluster { id = confluent_kafka_cluster.source_us.id }
topic_name = "settlement.completed"
partitions_count = 12
config = {
"retention.ms" = "220752000000" # ~7 years
"confluent.tier.local.hotset.ms" = "21600000" # ~6h hot on disk
"cleanup.policy" = "delete"
}
rest_endpoint = confluent_kafka_cluster.source_us.rest_endpoint
credentials {
key = data.vault_kv_secret_v2.cc_admin.data["api_key"]
secret = data.vault_kv_secret_v2.cc_admin.data["api_secret"]
}
}
confluent.tier.local.hotset.ms is the lever that makes long retention cheap: it caps disk-resident data regardless of retention.ms. Reads of older offsets transparently fetch from object storage with slightly higher latency — exactly the right trade for replay and audit.
5. Create the Cluster Link from destination, pulling the source
Cluster Linking is destination-initiated: you create the link object on the eu-west-1 cluster and point it at us-east-1. The link runs as managed Confluent infrastructure — there is no Connect cluster or MirrorMaker to babysit.
First write a config file giving the link the source’s bootstrap and a source-scoped key (fetched from Vault) so it can authenticate back to us-east-1:
# link-config.properties
link.mode=DESTINATION
connection.mode=OUTBOUND
bootstrap.servers=<SOURCE_BOOTSTRAP>:9092
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule \
required username="<LINK_KEY>" password="<LINK_SECRET>";
consumer.offset.sync.enable=true
auto.create.mirror.topics.enable=false
Then create the link, naming source and destination cluster IDs explicitly:
confluent kafka link create payments-east-to-west \
--cluster lkc-dest-euw1 \
--source-cluster lkc-source-use1 \
--source-bootstrap-server <SOURCE_BOOTSTRAP>:9092 \
--config-file link-config.properties
confluent kafka link list --cluster lkc-dest-euw1 # expect the link in 'ACTIVE'
consumer.offset.sync.enable=true is the setting that makes failover clean: it copies committed consumer-group offsets from source to destination, so a group that failed over resumes at the right place instead of from zero or from latest.
6. Create mirror topics on the destination
Mirror topics are read-only copies that inherit the source’s partition count and exact offsets. Create one per topic you want in Europe, over the link you just made.
confluent kafka mirror create settlement.completed \
--cluster lkc-dest-euw1 --link payments-east-to-west
confluent kafka mirror create order.events \
--cluster lkc-dest-euw1 --link payments-east-to-west
confluent kafka mirror list --cluster lkc-dest-euw1 --link payments-east-to-west
Set the destination topics’ own Tiered Storage retention so Europe keeps the same seven-year horizon locally (mirror topics tier independently of the source):
confluent kafka topic update settlement.completed --cluster lkc-dest-euw1 \
--config retention.ms=220752000000 \
--config confluent.tier.local.hotset.ms=21600000
Now eu-west-1 consumers read settlement.completed from the destination at low latency, while writes still happen only in us-east-1.
7. Wire the pipeline, identity, and observability
Make all of the above repeatable and watched.
- CI/CD: the Terraform for clusters, topics, links, and ACLs lives in git and is applied by Jenkins or GitHub Actions on merge; Argo CD reconciles the desired Confluent state continuously so manual console drift is reverted. The pipeline pulls Confluent and cluster credentials from Vault at job time via a short-lived token — no static secret in the repo.
- Human access: operators reach the Confluent Cloud Console and CLI only through Okta (or Entra ID) SSO with MFA; RBAC role bindings (
EnvironmentAdmin,Operator) map to IdP groups, so off-boarding in the IdP instantly removes Confluent access. - Observability: export Confluent’s Metrics API into Datadog or Dynatrace and alert on
cluster_link_mirror_topic_offset_lag(replication lag) and tiered-storage fetch latency. A sustained lag spike or a link dropping out ofACTIVEshould page on-call and auto-open a ServiceNow incident. - Posture: Wiz audits the underlying cloud account and the object-storage buckets Tiered Storage uses, flagging any public exposure or misconfigured encryption drift on the tiered data.
Validation
Prove the three things that matter: data flows, offsets match, and old data really comes from object storage.
# 1) produce on the SOURCE
echo '{"id":"s-1001","amt":42.50}' | \
confluent kafka topic produce settlement.completed --cluster lkc-source-use1
# 2) consume the MIRROR on the DESTINATION (should appear within seconds)
confluent kafka topic consume settlement.completed \
--cluster lkc-dest-euw1 --from-beginning --print-key
# 3) link health + per-partition lag should be near zero
confluent kafka mirror describe settlement.completed \
--cluster lkc-dest-euw1 --link payments-east-to-west
Confirm tiering is actually offloading by reading from an offset older than the hotset and watching the Metrics API report a non-zero tiered-bytes-fetched for that topic. Verify offset fidelity by comparing the high-water marks on source and mirror — they should track partition-for-partition. Finally, validate offset sync: commit a consumer group on the source, then describe that group on the destination and confirm its offsets were synced.
Rollback / teardown
Tear down in reverse dependency order so nothing is left orphaned (and so a half-deleted link does not block re-creation). Mirror topics must be stopped/deleted before the link.
# stop mirroring (makes mirror topics writeable) or delete them outright
confluent kafka mirror delete settlement.completed \
--cluster lkc-dest-euw1 --link payments-east-to-west
# delete the link
confluent kafka link delete payments-east-to-west --cluster lkc-dest-euw1
# if decommissioning, destroy via Terraform so state stays truthful
terraform destroy -target=confluent_kafka_cluster.dest_eu
For a planned failover (not teardown), do not delete — promote: confluent kafka mirror promote settlement.completed --cluster lkc-dest-euw1 --link payments-east-to-west stops mirroring, makes the topic writeable, and lets you repoint producers at eu-west-1. Record the promotion as a change in ServiceNow before you run it.
Common pitfalls
- Trying Cluster Linking on a Basic/Standard cluster. It is Dedicated-only; the link create will be rejected. Provision Dedicated CKUs first (Step 2).
- Forgetting
consumer.offset.sync.enable. Without it, failed-over consumer groups restart from the wrong position. Set it at link creation — it cannot be retrofitted as cleanly later. - Setting huge
retention.msbut leaving the default local hotset. You then keep years of data on broker disk and your CKU cost explodes. The whole point ofconfluent.tier.local.hotset.msis to cap disk; set it small. - No schemas in the destination region. Consumers in
eu-west-1fail to deserialize because the schema IDs are unknown. Use Schema Linking or a shared Stream Governance package so schemas exist on both sides. - Writing to a mirror topic. Mirror topics are read-only until promoted; an app that tries to produce to one gets an error. Producers stay on the source until an explicit failover.
- Auto-creating mirror topics blindly. Leaving
auto.create.mirror.topics.enable=truecan pull topics you never intended to replicate (and their cost). Keep itfalseand create mirrors explicitly.
Security notes
Every credential in this build is least-privilege and short-lived. The bootstrap Cloud key exists only to provision and is held in Vault; runtime producers, consumers, and the link each get a cluster-scoped API key with ACLs limited to the exact topics and operations they need, also issued from Vault so rotation is a key-roll, not a redeploy. Humans never use a shared login — Okta/Entra SSO with MFA fronts the Console and CLI, and Confluent RBAC bindings track IdP groups so de-provisioning is immediate. All traffic is SASL_SSL/TLS in transit, and Tiered Storage data at rest is encrypted in the object store; Wiz continuously checks those buckets and the cloud account for drift such as public exposure or weakened encryption. A link dropping out of ACTIVE, or an unexpected ACL change, raises a ServiceNow incident so security sees a ticket, not just a metric.
Cost notes
Two levers dominate. First, CKU sizing drives the base cost of each Dedicated cluster — size to sustained throughput, not peak, and scale CKUs through Terraform when load grows rather than over-provisioning up front. Second, Tiered Storage is the cost win that makes seven-year retention viable: object storage is dramatically cheaper per GB than broker disk, so a small confluent.tier.local.hotset.ms keeps the expensive hot tier tiny while history accumulates cheaply in S3/GCS/Blob. Cluster Linking itself bills on bytes mirrored across regions, so replicate only the topics that genuinely need a second region — not the whole cluster — and let cross-region egress be a deliberate choice. Pipe the Confluent billing and Metrics API into your Datadog/Dynatrace cost dashboard so the platform team sees mirrored-GB and tiered-GB trends and can challenge any topic whose cross-region replication is not earning its keep.