Architecture AWS

Real-Time Analytics with Databricks and Confluent Kafka on AWS

A fast-fashion retailer — call it the kind of brand that drops a new collection every Thursday and watches it sell out by Saturday — gets an ultimatum from its Chief Digital Officer the morning after a botched flash sale. The homepage merchandising had promoted a sold-out SKU for forty minutes because the inventory and recommendation systems were running on a batch pipeline that refreshed every six hours. Forty minutes of paid traffic landing on an out-of-stock hero product, during the highest-converting window of the week. The ask afterward is blunt: “I want the site to react to what shoppers are doing right now — re-rank products, suppress what’s out of stock, and trigger the cart-abandonment journey within seconds, not after the next batch run.” The constraint is the one every retailer lives with: Black Friday and collection-drop spikes are 20-40x normal traffic, the clickstream is PII-adjacent (it links to logged-in customer IDs and is governed under GDPR/CCPA), and the platform has to be cheap enough at idle on a Tuesday afternoon that finance does not kill it. A nightly Spark job will not do this. This article is the reference architecture for building that platform properly on AWS — a streaming clickstream analytics system on Confluent Cloud Kafka and Databricks Structured Streaming, landing in a governed Delta Lake, that a retailer’s data leadership and CISO will actually sign.

The pressures stack the way they always do in retail. Latency means the recommendation service needs fresh signal in single-digit seconds, not hours. Scale means a Thursday drop can 30x the event rate inside a minute, and the platform must absorb that without dropping events or paging anyone. Governance means clickstream tied to customer identity is regulated personal data with retention limits and a right-to-erasure obligation. And cost means the compute has to scale to near-zero on quiet days, because the brand does not run flash sales 24/7. Streaming — specifically Kafka as the durable event backbone feeding Structured Streaming jobs that write to Delta Lake — is the pattern that satisfies all four at once: it decouples the producers (the website, the mobile app, the POS) from the consumers (analytics, ML features, dashboards) through a replayable log, and processes events incrementally as they arrive instead of re-reading the world every night.

Why not the obvious shortcuts

The naive fixes each fail predictably, and naming why matters because someone on the project will propose all three.

Shrinking the batch window — running the nightly Spark job every fifteen minutes instead — sounds like the cheap answer and is the most common trap. It multiplies cluster spin-up cost, still leaves a fifteen-minute blind spot during a drop, and does nothing for the cart-abandonment journey that needs sub-minute reaction. Querying the production database directly for “what’s selling right now” puts analytical scan load on the same OLTP store that is serving checkout during a spike — the surest way to turn a busy Thursday into an outage. A bespoke Lambda-and-DynamoDB micro-pipeline can ingest events fast, but it gives you no replay, no schema enforcement, no historical analytics, and a pile of glue code that no one wants to own at 2 a.m. on Black Friday.

Kafka plus Structured Streaming threads the needle. Kafka is a durable, replayable, partitioned log — if a consumer falls over, it resumes from its committed offset and reprocesses nothing twice; if you ship a bug in the enrichment logic, you fix it and replay from a week ago. Structured Streaming gives you exactly-once writes into Delta Lake with the full SQL and DataFrame API, so the same engine and the same code serve real-time and historical analytics. And Delta Lake’s ACID table format means the “what’s selling right now” dashboard and the data scientist’s training set read the same governed tables, not two divergent copies.

Architecture overview

Real-Time Analytics with Databricks and Confluent Kafka on AWS — architecture

The platform runs two cooperating planes that share infrastructure but live on different rhythms: a hot streaming path that ingests, enriches, and serves clickstream in seconds, and a batch/serving plane of Delta tables that the BI and ML consumers read. Keeping them separate in your head is the first step to operating this well.

The defining property of the topology is the one the security team cares about most: Confluent Cloud and the Databricks data plane both run inside the retailer’s AWS network boundary, reached over AWS PrivateLink, with no event or credential traversing the public internet. Confluent Cloud is reached through a PrivateLink endpoint; the Databricks classic compute plane runs in the customer’s VPC; S3 access goes through VPC gateway endpoints. That privacy posture is what makes a GDPR story for customer-linked clickstream defensible.

Hot path, following the event flow:

  1. The website, mobile app, and POS emit clickstream events — product_view, add_to_cart, search, checkout_step — to Confluent Cloud Kafka topics. Producers authenticate with short-lived API keys issued from HashiCorp Vault’s Confluent secrets engine, so no long-lived Kafka credential is baked into an app config. Every event is serialized against a schema registered in Confluent Schema Registry, which rejects a producer that ships a breaking change before it can poison downstream consumers.
  2. Events land on partitioned, replicated topics. Partitioning by customer_id (with a null-safe hash for anonymous sessions) keeps a single shopper’s events ordered on one partition, which the sessionization logic depends on. Confluent’s Tiered Storage offloads older segments to object storage so the platform can retain a long replay window cheaply.
  3. A Databricks Structured Streaming job — running on a Jobs cluster in the customer VPC — reads the topics with the native Kafka source, using Auto Loader-style checkpointing to S3 for exactly-once progress tracking. It deserializes against Schema Registry, enriches each event by joining a slowly-changing product and customer dimension (broadcast Delta tables), and computes streaming aggregates: per-SKU view/cart velocity over a sliding window, and per-session state for abandonment detection using flatMapGroupsWithState.
  4. The job writes to Delta Lake in a medallion layout: raw events to Bronze (append-only, the immutable system of record), cleaned and enriched events to Silver, and the business aggregates to Gold. Each write is an ACID transaction; MERGE upserts keep Gold idempotent under replay.
  5. Two consumers read the Gold tables. The recommendation/merchandising service reads low-latency aggregates (served from Delta through Databricks SQL or mirrored to an online store) to re-rank and suppress out-of-stock SKUs on the site. A separate Structured Streaming sink emits an cart_abandoned event back to a Kafka topic, which a marketing consumer turns into the abandonment journey — closing the loop the CDO asked for.
  6. Unity Catalog governs every table, view, and ML feature across the workspace: one place for table ACLs, column masking on PII, and end-to-end lineage from a Kafka topic to a dashboard tile.

Batch/serving plane, on its own cadence: BI tools query Databricks SQL warehouses over the Gold tables for the live merchandising dashboard; data scientists build training sets and features in Databricks Feature Store over Silver/Gold, governed by the same Unity Catalog grants; scheduled jobs run OPTIMIZE and VACUUM to compact small streaming files and reclaim storage. Because everything is Delta, the real-time dashboard and the offline model read the same tables — no Lambda-architecture duplication, no reconciliation pain.

Component breakdown

Component Service / tool Role in the platform Key configuration choices
Edge Akamai TLS, anycast, WAF, bot mitigation in front of the web tier that emits events Bot filtering so scraper traffic does not pollute the clickstream; origin shield
Event backbone Confluent Cloud Kafka Durable, replayable, partitioned event log PrivateLink ingress; partition by customer_id; Tiered Storage for long retention
Schema governance Confluent Schema Registry Enforce event contracts, block breaking changes BACKWARD compatibility; Avro/Protobuf; CI check on schema PRs
Stream processing Databricks Structured Streaming Ingest, enrich, aggregate, sessionize, write to Delta Kafka source; S3 checkpoints; availableNow vs continuous trigger per job
Lakehouse storage Delta Lake on S3 ACID medallion tables (Bronze/Silver/Gold) MERGE for idempotency; partitioned by event date; deletion vectors
Governance Unity Catalog Table ACLs, column masking, lineage, audit Row filters + column masks on PII; lineage graph; system tables for audit
Secrets HashiCorp Vault Short-lived Kafka API keys, S3 creds, third-party tokens Confluent secrets engine; dynamic leases; AWS auth method for jobs
Identity / SSO Okta + AWS IAM Identity Center Workforce SSO into Databricks, Confluent, AWS SAML/SCIM provisioning; group claims map to Unity Catalog grants
CSPM / data posture Wiz + Wiz Code Cloud posture, S3/Confluent exposure, IaC scanning Agentless scan of S3 + VPC; Wiz Code gates Terraform PRs on misconfig
Runtime security CrowdStrike Falcon Runtime protection on Databricks driver/worker EC2 and producer hosts Sensor in the worker AMI; detections piped to the SOC
Observability Datadog Stream-lag SLOs, throughput, cluster + cost telemetry Kafka consumer-lag monitors; Spark Structured Streaming integration; SLO dashboards
ITSM / approvals ServiceNow Change approvals for schema/topic changes; incident records Change gate before a new topic or schema goes live; auto-ticket on lag-SLO breach
CI/CD + IaC GitHub Actions + Argo CD + Terraform Pipeline build/test; GitOps deploy of jobs; infra as code OIDC to AWS (no stored creds); Terraform for Confluent + Databricks; Argo CD syncs DLT/job specs
Config management Ansible Hardened, sensor-baked worker AMIs and producer host baselines Golden AMI build; CrowdStrike + agent baked in; CIS hardening

A few of these choices deserve the why, because they are the ones teams get wrong.

Why partition by customer_id, and why it is a tradeoff. Sessionization and abandonment detection need all of one shopper’s events in order on one partition, so customer_id is the natural key. The cost is hot partitions: a handful of power-users or a bot can skew load onto one partition during a spike. Mitigate with a salted key for known-hot identities and bot filtering at Akamai so scraper traffic never reaches Kafka in the first place. The alternative — partitioning round-robin for perfect balance — breaks per-session ordering and is the wrong trade for this workload.

Why Structured Streaming, not Kafka Streams or Flink, here. The team already runs Databricks for batch analytics and ML. Structured Streaming lets them use one engine, one language (PySpark/SQL), and one governance plane (Unity Catalog) for real-time and historical work, and write straight into the same Delta tables the data scientists already use. A standalone Flink cluster would be lower-latency at the extreme, but it adds a second system to operate, a second security boundary, and a sync problem between its output and the lakehouse. For seconds-not-milliseconds retail analytics, the unified lakehouse wins.

Why the medallion layout, not a single table. Bronze is the immutable, replayable record — if enrichment logic has a bug, you reprocess Bronze, not the source. Silver is the cleaned, conformed, PII-governed layer. Gold is the business-ready aggregate the dashboard reads. Collapsing them means a bad enrichment overwrites your only copy of the truth, and you cannot replay. The separation costs storage and a little latency per hop, and it is non-negotiable for a regulated, replayable pipeline.

Implementation guidance

Provision with Terraform, and treat the network and identity as the first deliverables. The deployment order matters: get PrivateLink and the VPC endpoints wrong and the streaming job fails to reach Kafka or S3 and hangs silently.

  1. The customer VPC with private subnets for the Databricks classic compute plane, gateway VPC endpoints for S3, and an interface endpoint / PrivateLink to Confluent Cloud.
  2. Confluent Cloud environment, cluster, topics, and Schema Registry — provisioned with the Confluent Terraform provider so topics, partition counts, and ACLs are code, not console clicks.
  3. Databricks workspace with Unity Catalog enabled, the metastore attached, and the storage credential / external location pointing at the S3 data buckets.
  4. Vault with the Confluent and AWS secrets engines configured, and the Databricks jobs granted the AWS auth role.
  5. Jobs and Delta Live Tables pipelines deployed via Argo CD from a Git repo (Databricks Asset Bundles), so the streaming jobs follow the same GitOps flow as everything else.

A minimal Terraform shape for the Confluent topic communicates the intent — explicit partitions, retention, and compatibility:

resource "confluent_kafka_topic" "clickstream" {
  topic_name       = "clickstream.events.v1"
  partitions_count = 48                     # sized for drop-day peak, not idle
  rest_endpoint    = confluent_kafka_cluster.prod.rest_endpoint

  config = {
    "cleanup.policy" = "delete"
    "retention.ms"   = "604800000"          # 7-day replay window
    "min.insync.replicas" = "2"             # durability under broker loss
  }
}

And the heart of the hot path — the Structured Streaming read and exactly-once Delta write — fits in a few lines and shows where exactly-once actually comes from:

events = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", dbutils.secrets.get("vault", "confluent_bootstrap"))
    .option("subscribe", "clickstream.events.v1")
    .option("startingOffsets", "latest")
    .load())

(parse_and_enrich(events)              # deserialize via Schema Registry + dim join
    .writeStream
    .format("delta")
    .option("checkpointLocation", "s3://lake/_checkpoints/bronze_events")  # exactly-once
    .outputMode("append")
    .trigger(processingTime="10 seconds")
    .toTable("retail.bronze.clickstream"))

The checkpoint on S3 is what makes the write exactly-once: it records the Kafka offsets committed atomically with each Delta transaction, so a job restart resumes precisely where it left off — no gaps, no duplicates.

Identity: federate the humans, lease the machines. Human SSO flows Okta → AWS IAM Identity Center → Databricks and Confluent over SAML, with SCIM provisioning so a leaver loses access everywhere at once. Okta group membership maps to Unity Catalog grants, so “merchandising-analysts” get read on Gold but never see the PII columns in Silver. Machine identity is short-lived: producer apps and Databricks jobs fetch Kafka API keys and S3 credentials from HashiCorp Vault with dynamic leases via the AWS auth method, so there is no static client.secret in any config file — a hard lesson the platform team intends never to repeat.

Schema as a contract, enforced in CI. Register every event schema in Confluent Schema Registry with BACKWARD compatibility and gate schema changes in GitHub Actions: a PR that proposes a breaking change fails the build before it can break consumers. New topics and schema changes additionally pass a ServiceNow change approval so data governance has a documented gate, not a surprise in production.

Enterprise considerations

Security & Zero Trust. The architecture is Zero Trust by construction: PrivateLink everywhere, no public data-plane surface, identity-based access only, and least-privilege grants scoped per table in Unity Catalog. Layer on top: (a) Unity Catalog column masking and row filters on every PII column in Silver, so even an analyst with table access sees hashed customer IDs unless explicitly entitled; (b) Wiz running continuous CSPM and sensitive-data scanning across S3 and the VPC, alerting the moment a bucket drifts toward public exposure or an over-broad IAM policy appears, with Wiz Code scanning the Terraform in the PR so a misconfiguration is caught before apply, not after; © CrowdStrike Falcon sensors baked into the Databricks worker AMI (built by Ansible) and on the producer hosts, giving runtime threat detection that feeds the retailer’s SOC; (d) a lag-SLO breach or a blocked schema change auto-raises a ServiceNow incident so the on-call has a ticket, not just a Datadog alert. The right-to-erasure obligation is handled in Delta with DELETE ... WHERE customer_id = ? and deletion vectors, with VACUUM enforcing the retention horizon — Unity Catalog’s lineage proves the deletion propagated.

Cost optimization. Compute dominates on a streaming platform, and the retail traffic shape — quiet weekdays, violent drop-day spikes — is exactly where naive always-on clusters bleed money.

Lever Mechanism Typical effect
Spot for workers Run Databricks worker nodes on EC2 Spot with on-demand driver 50-70% off worker compute
Autoscaling + auto-terminate Cluster autoscaling on streaming backlog; idle SQL warehouses suspend Scales toward zero on quiet days
Trigger tuning availableNow micro-batch for non-urgent sinks vs continuous for hot path Avoids paying for idle continuous jobs
Tiered Storage Confluent offloads old segments to object storage Cheap long replay window without fat brokers
OPTIMIZE / file sizing Compact small streaming files; right-size partitions Cuts S3 request and scan cost on reads

Pipe cluster and Kafka cost telemetry to Datadog and tag it by pipeline, so finance sees cost-per-stream and the team can defend the platform’s idle-day economics — the number that decided whether this got built at all.

Scalability. Each tier scales independently. Confluent scales by adding partitions (consumer parallelism) and CKU capacity (throughput); size partitions for the drop-day peak, since you can add but not easily shrink them. Databricks scales workers via cluster autoscaling keyed on streaming backlog, and you scale a single job by adding partitions upstream so more tasks can run in parallel. Delta read performance scales with OPTIMIZE compaction and Z-ordering on the columns the dashboard filters. The natural ceiling is partition count chosen too low at the start — under-provision partitions and a 30x drop-day spike cannot fan out across workers no matter how many you add, which is why partition sizing is a peak-planning decision, not a steady-state one.

Failure modes, and what each one looks like. Name them before they page you.

Reliability & DR (RTO/RPO). Decide the numbers per tier. Confluent Cloud offers a managed multi-AZ cluster (near-zero RPO within a region) and Cluster Linking to a paired region for cross-region DR. Delta on S3 inherits S3’s durability, with cross-region replication of the data buckets as the recovery guarantee. Because Kafka holds a 7-day replay window, the lakehouse is rebuildable by replaying Bronze — the strongest recovery story this design has. A pragmatic target for this platform: RTO 30 minutes, RPO near-zero for the hot aggregates (resume the streaming job in the paired region from replicated checkpoints), with the full historical lakehouse rebuildable from cross-region S3 and Kafka replay. Akamai health checks drive edge failover for the web tier that emits events.

Observability. Instrument the pipeline end to end in Datadog: Kafka consumer lag per consumer group, end-to-end event latency (event timestamp to Gold write), micro-batch duration and input rows from the Spark Structured Streaming integration, and cluster utilization and cost. Define the SLO the business actually feels — p99 click-to-insight lag under ten seconds — and alert on its error budget, not on raw CPU. Emit the metrics merchandising cares about (per-SKU velocity freshness, abandonment-event emission rate) as first-class monitors. Unity Catalog system tables provide the audit and lineage trail compliance reads. New schemas and topics pass a ServiceNow change approval before going live, giving governance a documented gate.

Governance. Pin Databricks Runtime versions explicitly so streaming semantics do not drift under you; promote runtime and library upgrades through a staging workspace first. Keep job definitions, schemas, and Terraform in version control, reviewed and instantly revertable, deployed via Argo CD so production matches Git. Govern every table through Unity Catalog with explicit grants, column masks, and lineage, and let Wiz independently verify the storage posture the grants assume. Log producer and consumer access for audit, with the retention and right-to-erasure paths wired into Delta — because clickstream tied to customer identity is personal data under the same regime that started this whole project.

Explicit tradeoffs

Accept these or do not build it. Streaming adds real moving parts — a Kafka cluster to operate, checkpoints and offsets to reason about, schema evolution to manage, and small-file maintenance that batch never made you think about. Exactly-once is exactly-once for the Delta write; the moment you push to an external system (the online recommendation store, the marketing topic) you are back to at-least-once and must make those consumers idempotent. End-to-end latency is the sum of Kafka, the micro-batch trigger interval, and the write — seconds, realistically, not milliseconds, so a true sub-second personalization need wants a different tool. The PrivateLink-everywhere posture that makes the CISO sign costs you setup complexity and the loss of public debugging shortcuts, and forgetting one VPC endpoint is a silent hang, not a clear error. The Okta-to-IAM-Identity-Center federation adds a hop the single-IdP shops will not need. And the medallion layering, the schema registry, and the per-stream cost tagging are overhead you can skip for a proof-of-concept and absolutely cannot skip for a regulated, drop-day-scale platform.

The alternatives, and when they win. If your latency need is genuinely sub-second per event and the logic is simple, a Flink or Kafka Streams application is lower-latency than micro-batch Structured Streaming — at the cost of a second system to run and a sync problem with the lakehouse. If you want a fully managed, less-operable path and can live with its constraints, Amazon Kinesis Data Streams + Managed Service for Apache Flink keeps everything inside AWS-native services and skips the Confluent contract — though you trade away Confluent’s Schema Registry maturity, Tiered Storage, and multi-cloud portability. If your “real time” really means “every fifteen minutes is fine,” an incremental batch job with Auto Loader and Delta Live Tables is dramatically simpler and cheaper — graduate to this full streaming platform only when seconds, replay, and drop-day scale genuinely demand it. And Databricks Delta Live Tables can manage the streaming pipeline declaratively if you prefer expectations-and-managed-orchestration over hand-rolled Structured Streaming jobs; it composes with everything here.

The shape of the win

For the retailer, the payoff is not “a faster dashboard.” It is that a shopper adds the last unit of a hero SKU to their cart, and within ten seconds the merchandising service stops promoting it to everyone else and the recommendation rail re-ranks toward what is actually in stock — so the next forty minutes of paid traffic land on products people can buy, during the exact window that the last botched drop bled money. That is the sentence that funds the platform. Everything upstream — the PrivateLink boundary, the Okta-to-IAM-Identity-Center federation, the Vault-leased Kafka keys, the Unity Catalog masking, the Wiz posture scanning, the Datadog stream-lag SLO — exists to make a CDO, a CISO, and a CFO each say yes. The architecture here is the destination; start with incremental batch if you must, but a retailer that wants the site to react to shoppers in real time, at drop-day scale, under GDPR, has to land here.

AWSDatabricksConfluent KafkaDelta LakeStreamingEnterprise
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading