Data Multi-cloud

Real-Time CDC with Debezium and Kafka

Nightly batch ETL is the load-bearing lie of most data platforms. Every analytics deck quietly assumes the warehouse reflects the business, but a SELECT against last night’s snapshot is a photograph of yesterday — and the gap between “yesterday” and “now” is where fraud clears, inventory goes negative, and a regulator’s reconciliation breaks. Change data capture (CDC) closes that gap by streaming every insert, update, and delete out of the operational database the moment it commits, in commit order, with no batch window and no polling load on the source. This article is a reference architecture for doing it properly with Debezium and Kafka: not a connector demo, but a durable, schema-aware, exactly-once pipeline that feeds a warehouse or lakehouse and holds up from a handful of tables to a few thousand across dozens of source databases.

The business scenario

The forcing function here is almost always a regulated business that can no longer tolerate stale data. Consider a mid-size payments processor: it settles card transactions for a few thousand merchants, and its core ledger lives in a PostgreSQL fleet. The fraud team needs to score transactions against aggregate merchant behaviour within seconds, not after a 2 a.m. batch. The finance team must produce a continuously-reconciled position because the central bank’s intraday liquidity rules now expect near-real-time reporting, not an end-of-day file. And the data science team wants every historical state of every row — not just the current value — to train models and to answer “what did this account look like at 14:32 when the chargeback posted?”

The naive fixes all fail in a recognisable way. Batch ETL with a nightly dump is hours stale and hammers the production database with a full-table scan during the maintenance window. Polling a last_updated column misses hard deletes entirely (a deleted row simply stops appearing — you never learn it left), misses any update that doesn’t touch the watermark column, and still pounds the source with WHERE updated_at > ? queries. Dual-writing from the application to both the database and the warehouse is the worst of all: the two writes are not in one transaction, so the day the second write fails the systems silently diverge, and nobody notices until an auditor does.

Log-based CDC threads the needle. The database already writes every committed change to its transaction log — the PostgreSQL write-ahead log (WAL), MySQL’s binlog, the SQL Server transaction log, Oracle redo. That log is the database’s own source of truth, ordered exactly as transactions committed. Debezium reads that, not the tables, so it captures inserts, updates, and deletes in commit order, imposes near-zero load on the source (it is a log reader, not a query workload), and never misses a change. The processor gets a faithful, ordered stream of every mutation to its ledger, fanned out to fraud scoring, to the finance reconciliation engine, and to the warehouse — from one capture, with one consistency story.

The scenario scales cleanly. A small shop captures twenty tables from one database into one warehouse. The large enterprise captures thousands of tables across PostgreSQL, MySQL, and SQL Server estates, fans the same streams to a dozen consumers (search indexes, caches, ML feature stores, the lakehouse), and treats Kafka as the central nervous system of the company. The architecture is the same — what changes is the number of connectors, the partition counts, and how hard you lean on the registry and exactly-once machinery.

Architecture overview

Real-Time CDC with Debezium and Kafka — architecture

The pipeline has four stages that you should keep mentally distinct because they fail and scale independently: capture (Debezium reading the log), transport (Kafka durably buffering and ordering), govern (the schema registry mediating contracts), and sink (loading the warehouse or lake exactly once).

Capture. A Debezium source connector runs inside Kafka Connect — a distributed worker cluster, not a sidecar in your app. The Postgres connector creates a logical replication slot and a publication, reads the WAL through the pgoutput plugin, and emits one event per row change. Each event is a rich envelope: a before image, an after image, the operation (c/u/d/r for create/update/delete/snapshot-read), and a source block carrying the LSN, transaction id, table, and commit timestamp. On first start the connector takes a consistent snapshot of existing rows (emitted as r reads) so the warehouse begins with full state, then switches seamlessly to streaming the log from the snapshot’s LSN forward. Connector secrets — the database replication user’s password — are never in the connector JSON; they resolve at runtime from HashiCorp Vault through the Connect config provider, so credentials rotate without editing pipeline config.

Transport. Each captured table maps to a Kafka topic (dbserver.public.transactions). Kafka is the durable, replayable backbone: events are retained, partitioned, and ordered within a partition. The single most important transport decision is keying events by primary key so that all changes to a given row land on the same partition and are therefore strictly ordered — without that, an update can overtake the insert that created it and the sink applies them out of order. Topics are replicated (replication factor 3) across brokers so a node loss costs nothing.

Govern. Debezium serializes events with Avro (or Protobuf) and registers each table’s schema in the Confluent Schema Registry (or Apicurio). The registry assigns every schema a version and enforces a compatibility policy on the wire, so the producer and every downstream consumer share one evolving contract instead of guessing at JSON shapes. This is what makes schema evolution survivable rather than a 3 a.m. incident.

Sink. A sink — Kafka Connect’s JDBC sink, the Confluent/Databricks connectors, or a stream processor like Apache Flink or Kafka Streams — consumes the topics and applies changes to the target: Snowflake, BigQuery, Databricks Delta Lake, or Amazon Redshift. Done right, this is exactly-once: each change lands in the warehouse precisely once even across retries, broker failovers, and connector restarts, so the target is a faithful mirror of the source rather than a pile of duplicates.

Wrapped around all four stages: identity from Microsoft Entra ID or Okta (SSO and RBAC into Connect, the registry UIs, and the warehouse), data-posture scanning from Wiz (is a PII column flowing into a topic nobody locked down?), runtime protection from CrowdStrike Falcon on the Connect and broker hosts, and end-to-end telemetry in Datadog or Dynatrace watching the one metric that matters most — replication lag.

Component breakdown

Stage Tooling Role in the pipeline Key configuration choices
Capture Debezium on Kafka Connect Read the DB transaction log, emit per-row change events Logical slot + publication (Postgres); snapshot.mode; key by PK; secrets via Vault provider
Transport Apache Kafka (MSK / Confluent Cloud / Event Hubs) Durable, ordered, replayable event backbone RF=3, min.insync.replicas=2, partition by key, tiered storage for long retention
Govern Confluent Schema Registry / Apicurio Version + enforce the data contract across producers and consumers BACKWARD compatibility; Avro; subject-per-topic
Process (optional) Flink / Kafka Streams / ksqlDB Joins, enrichment, deduplication, outbox routing Exactly-once checkpointing; keyed state for upserts
Sink JDBC / Snowflake / BigQuery / Databricks connectors Apply changes to the warehouse or lakehouse exactly once upsert mode, idempotent writes, exactly_once_support=required
Identity Entra ID / Okta SSO + RBAC into Connect, registry, warehouse OIDC; least-privilege replication role on the source
Secrets HashiCorp Vault Source/sink credentials, rotated, never in config Connect config.providers → Vault; dynamic DB creds
Data posture Wiz Find PII/regulated data leaking into unsecured topics DSPM scan of topics + warehouse; lineage to owner
Runtime security CrowdStrike Falcon Detect compromise on broker/Connect/sink hosts Agent on every node; block on the Connect plane
Observability Datadog / Dynatrace Lag, throughput, connector health, DLQ depth Alert on replication lag p95 + connector RUNNING state
ITSM / approvals ServiceNow Change approval for new connectors and schema changes Pipeline gate; ties schema change to a CAB ticket
CI / IaC GitHub Actions / Jenkins + Terraform Version, review, and deploy connectors and topics GitOps connector configs; terraform for MSK/topics/ACLs

A few choices deserve the why, because they are the ones teams get wrong.

Why log-based capture, not query-based. Reading a last_updated column feels simpler, but it is structurally broken: it cannot see deletes (the row is just gone), it misses updates that don’t bump the watermark, and it competes with production traffic. The transaction log already contains every committed change in exact order — Debezium reading the log is lower-impact on the source and strictly more complete. The one operational cost is that you must keep the log around long enough for the connector to read it: an unmonitored, lagging Postgres slot pins the WAL and can fill the source disk. That is a monitoring problem, not a reason to poll.

Why key by primary key. Kafka only guarantees order within a partition. If events for one row spread across partitions, the sink can apply an update before the insert that created the row, or a delete before a late update, and the warehouse ends up wrong. Keying every change event by the row’s primary key forces all of that row’s history onto one partition, in commit order — the foundation everything else stands on.

Why a schema registry is non-negotiable. Without it, every consumer reverse-engineers an implicit contract from whatever JSON happens to arrive, and the first time someone drops a column upstream, half your sinks break silently and the other half ingest garbage. The registry makes the schema an explicit, versioned, enforced artifact: a producer literally cannot publish a change that violates the configured compatibility policy, so an incompatible source change is caught at the source — at deploy time, in CI — instead of corrupting the warehouse at 3 a.m.

Schema evolution: the part that bites

Source schemas change — a column is added, widened, renamed, dropped — and in a streaming pipeline those changes ride the same stream as the data. Handle this with a compatibility contract in the registry, chosen deliberately:

Policy What it allows Reader/writer rule When to use
BACKWARD (default) Add optional fields, drop fields New schema reads old data Consumers upgrade first — the common CDC case
FORWARD Add fields, drop optional fields Old schema reads new data Producers upgrade first; old consumers must keep working
FULL Intersection of both Both directions safe Strict, slow-moving contracts
NONE Anything No checks Never, in an enterprise

The pragmatic default for CDC is BACKWARD: new columns arrive as nullable fields with defaults, so a downstream consumer running the new schema can still read events written under the old one, and old consumers ignore fields they don’t know. The hard cases are the ones the registry deliberately rejects — a non-nullable column with no default, an incompatible type change, a rename treated as drop-plus-add. Those must be a deliberate, reviewed migration, not an accident: the schema change rides through GitHub Actions where a registry-compatibility check fails the build if it would break consumers, and a ServiceNow change ticket records the approval before the connector config and the matching downstream DDL ship together.

A telling failure mode: someone runs ALTER TABLE ... DROP COLUMN on the source with no coordination. With BACKWARD compatibility the registry accepts it (dropping a field is backward-compatible), the column simply stops appearing in events, and the warehouse keeps the historical values it already has — graceful. The same person renaming the table’s primary key, however, breaks event keying and must be a planned cutover. Knowing which class a change falls into is the senior skill here.

Exactly-once to the warehouse

“Exactly-once” is the phrase everyone wants and few configure correctly. There are two halves.

Producer/transport side. Debezium gives at-least-once delivery by default: after a crash it re-reads from the last committed log offset, which can re-emit events already published. Kafka’s idempotent producer and transactional writes (enable.idempotence=true, transactional EXACTLY_ONCE mode in Connect) deduplicate those within Kafka, so the topic contains each change once even across connector restarts.

Sink side — where it actually matters. Even with at-least-once into the warehouse, you achieve effective exactly-once by making the apply idempotent: configure the sink in upsert mode keyed on the primary key, so re-delivering the same change is a no-op (the MERGE overwrites with identical values) and a delete is applied by primary key regardless of how many times it arrives. For warehouses that support it, the connector commits Kafka offsets and the warehouse write in one two-phase unit (Kafka Connect exactly.once.source.support, or the Snowflake/Databricks connectors’ transactional commit) so a failure between “wrote to warehouse” and “committed offset” cannot duplicate or lose a row.

A minimal idempotent JDBC sink communicates the intent:

{
  "name": "warehouse-sink-transactions",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
    "topics": "dbserver.public.transactions",
    "insert.mode": "upsert",
    "pk.mode": "record_key",
    "pk.fields": "txn_id",
    "delete.enabled": "true",
    "exactly.once.support": "required",
    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schema.registry.url": "https://schema-registry:8081",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "dlq.transactions"
  }
}

Two flags carry the weight: insert.mode=upsert with pk.fields makes every write a MERGE so retries are harmless, and delete.enabled=true honours Debezium’s tombstone events (a null-value record after a delete) so removals actually propagate instead of leaving zombie rows. The dead-letter queue (errors.deadletterqueue) catches the poison records — a malformed value, a type the warehouse rejects — so one bad event parks in the DLQ for inspection instead of halting the whole connector.

For lakehouse targets the better pattern is often a stream processor in the middle: Flink with exactly-once checkpointing consumes the CDC topics, deduplicates on key, and writes to Delta Lake or Apache Iceberg with transactional commits, materialising both a current-state table (latest row per key) and an append-only history table (every version, for the data scientists who wanted to ask “what did this look like at 14:32”). The append-only log is the time-travel answer.

Enterprise considerations

Security and access. The replication user on the source needs REPLICATION plus SELECT on captured tables and nothing more — never a superuser. Its password is a dynamic, short-lived credential issued by Vault and injected into the connector through Connect’s config provider, so it rotates without touching pipeline config and never lands in a JSON file or git. Kafka enforces mTLS between brokers and clients and ACLs scoped per connector (this Debezium principal may write only its own topic prefix; this sink may read only the topics it owns). Human access to Connect, the registry, and the warehouse rides SSO through Entra ID or Okta with RBAC. The under-appreciated risk is governance: change events carry full before/after images, so a captured table with a ssn or pan column streams that PII into a topic — Wiz DSPM scans topics and the warehouse to flag regulated data flowing into an unsecured stream and route it to an owner, and a single-message transform (SMT) masks or drops sensitive fields at the connector before they ever reach a topic. CrowdStrike Falcon runs on the broker, Connect, and sink hosts for runtime detection — the streaming control plane is high-value and an attacker who owns a Connect worker can read every change in the business.

Failure modes — know them cold.

Scaling. Three independent dials. Source side: one connector per database (or a few, partitioned by table set) — a single connector reads one log serially, so you scale across databases with more connectors, not by parallelising one log. Kafka side: partition count per topic sets consumer parallelism — but because order is per-partition and you keyed by PK, raise partitions thoughtfully (re-partitioning reshuffles keys and can briefly disorder a key’s history during the transition). Sink side: the sink connector’s task count fans out across partitions for throughput. The usual first bottleneck is the sink, not capture — warehouses prefer fewer larger writes, so batch sink commits (more rows per MERGE) rather than committing every event, trading a little latency for a lot of throughput.

Cost. The line items are Kafka (brokers/storage or Confluent Cloud / Amazon MSK / Azure Event Hubs throughput units), Connect compute, and warehouse ingest. Levers: (1) don’t capture columns you don’t need — column filtering and SMT drops shrink every downstream byte; (2) tiered storage (MSK tiered storage, Confluent infinite retention) keeps long replay history on cheap object storage instead of expensive broker disk; (3) batch sink writes — per-row MERGEs into Snowflake or BigQuery are the classic way to set money on fire, so accumulate and commit in batches; (4) right-size partitions — every partition has fixed overhead, and over-partitioning a low-volume table wastes broker resources for no parallelism gain; (5) compact topics that only need latest-state (cleanup.policy=compact) so the topic self-prunes to one record per key.

Reliability and DR (RTO/RPO). Kafka’s replication.factor=3 with min.insync.replicas=2 survives a broker loss with zero data loss. For region failure, MirrorMaker 2 (or Confluent Cluster Linking) replicates topics and consumer offsets to a second region, and the consequential detail is offset translation so a sink failing over resumes near where it stopped rather than reprocessing everything. Because the warehouse apply is idempotent upsert, some reprocessing on failover is harmless — exactly the property that makes CDC DR forgiving. A pragmatic target: RPO near zero (replicated, ordered log) and RTO in low minutes (restart connectors against the mirrored cluster, sinks resume by key). The deeper guarantee is that the source database is always the source of truth — if the whole pipeline is lost, you re-snapshot and rebuild the warehouse from scratch, slowly but correctly.

Observability. The signal that matters above all is end-to-end replication lag — the wall-clock delay between a commit on the source and that change landing in the warehouse. Instrument it in Datadog or Dynatrace from the source.ts_ms in the Debezium envelope to the sink commit time, and alert on the p95. Track connector state (RUNNING vs FAILED), per-topic consumer lag, DLQ depth (a rising DLQ is a schema or data-quality fire), and the Postgres slot lag that protects the source disk. Wire connector failures into ServiceNow so an incident opens automatically. The classic trap is alerting only on liveness — a healthy-looking connector silently hours behind is the outage nobody sees until the fraud team’s numbers are stale.

Reference enterprise example

Tollgate Payments, a fictional ₹-scale payments processor (~2,800 merchants, ~6 million transactions/day), built this pipeline to move fraud scoring and intraday reconciliation off nightly batch. Source: a 6-node PostgreSQL fleet behind their ledger; targets: Databricks Delta Lake for analytics and ML, plus a Kafka fan-out to the real-time fraud engine.

Decisions they made. They ran Debezium Postgres connectors on a 4-worker Kafka Connect cluster, one connector per shard, with incremental snapshotting so onboarding the 900-million-row transactions table never locked production. Events were Avro, keyed by txn_id, registered in Confluent Schema Registry under BACKWARD compatibility; a GitHub Actions check failed any PR whose schema change would break a consumer. Kafka was Amazon MSK (RF=3, ISR=2, tiered storage for 30-day replay). A Flink job consumed the CDC topics with exactly-once checkpointing, deduplicated on key, and wrote two Delta tables — current-state and full-history — while a second consumer fed fraud scoring directly. The replication user’s credentials came from Vault (rotated every 24h); an SMT masked the PAN field to its last four digits before it ever hit a topic, and Wiz DSPM continuously verified no raw card data leaked into an unsecured stream. CrowdStrike Falcon guarded the Connect and broker hosts; Datadog alerted on lag p95 and slot lag; ServiceNow gated every new connector behind CAB approval.

The numbers. End-to-end lag p95 held at ~3 seconds source-commit to Delta, against a former 6–18 hour batch gap. Peak throughput ~4,000 change-events/sec across the fleet. Monthly run cost landed near ₹11.6 lakh (~$13,900): MSK brokers + tiered storage ~$4,800, the Connect/Flink compute ~$4,200, Schema Registry + Databricks ingest ~$3,000, with Vault/Wiz/Datadog allocation the remainder. Tiered storage cut broker disk cost roughly in half versus retaining 30 days on local volumes, and dropping unused columns at the connector shaved meaningful Databricks ingest.

The outcome. Fraud scoring moved from next-day to sub-5-second, catching a class of rapid card-testing attacks the batch model never saw in time. Finance got a continuously-reconciled intraday position that satisfied the central bank’s near-real-time liquidity reporting — the line that got the CFO’s attention, because the old end-of-day file would not have passed the new rule. And the full-history Delta table let data science train on every prior state of every account, answering “what did this look like at 14:32” for the first time. In a region game-day, MirrorMaker 2 with offset translation failed the consumers to the secondary cluster; the idempotent Delta upserts reprocessed a few minutes of overlap harmlessly, and lag recovered inside the 5-minute RTO target.

When to use it

Use this architecture when you need fresh data (seconds, not hours), you must capture deletes and full change history rather than current-state snapshots, you have multiple consumers that should share one capture (warehouse and search and a cache and ML features), and you cannot afford to load the source database with polling. That covers most “our analytics are always stale” and “we need real-time X off the operational database” demand — fraud, reconciliation, real-time dashboards, cache and search invalidation, ML feature pipelines, and database replication or migration.

Trade-offs to accept. CDC adds real operating surface: a Connect cluster to run, replication slots and log retention to babysit, a registry to govern, and exactly-once semantics that must be configured, not wished for. The source DBA must enable logical replication and grant a replication role — a conversation, not a config flag. And the eventual-consistency reality is that the warehouse trails the source by your replication lag; downstream logic must tolerate a few seconds of skew and the rare reprocessed-on-failover duplicate (which idempotent upserts absorb).

Anti-patterns. (1) Query-based polling on a last_updated column — misses deletes and silent updates, loads the source. (2) Not keying by primary key — out-of-order apply corrupts the target. (3) No schema registry — one upstream column change breaks consumers silently. (4) At-least-once into a warehouse with append, not upsert — duplicate rows on every retry. (5) Unmonitored replication slot — pins the WAL and fills the source disk, turning a pipeline hiccup into a production outage. (6) Secrets in connector JSON — the replication password ends up in git; resolve from Vault instead.

Alternatives, and when they win. If you need only periodic, current-state copies and seconds-fresh is overkill, batch ELT (Fivetran, Airbyte, dbt on a schedule) is far simpler — use it when latency genuinely doesn’t matter. If your cloud database offers managed CDC (AWS DMS, GCP Datastream, Azure SQL change feed) and you don’t need Kafka’s fan-out or replay, the managed service skips running Connect and brokers — choose it when one source feeds one target and operational simplicity beats flexibility. If the application can adopt the transactional outbox pattern (write business data and an event row in one DB transaction, and let Debezium capture the outbox table), you get clean domain events instead of raw row changes — the better long-term shape for event-driven microservices, and it composes with everything here. The Debezium-and-Kafka architecture is the destination when freshness, multiple consumers, full change history, and cross-database scale all matter at once — not always the starting line.

CDCDebeziumKafkaStreamingData EngineeringArchitecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading