A renewable-energy operator running 1,800 wind turbines across a dozen onshore and offshore farms gets a hard deadline from its head of asset performance: the grid operator now penalises the company for every megawatt of unplanned curtailment, and the maintenance crews are still finding gearbox failures the way they always have — when a turbine trips and a technician drives two hours to a remote site. Each turbine streams a few hundred sensor channels (gearbox oil temperature, bearing vibration spectra, blade-pitch angle, nacelle yaw, generator winding temperature) at sub-second cadence. The ask is “see every turbine in real time, predict the failures before they trip, and prove to the regulator exactly what each asset was doing the moment it derated.” The constraints are unforgiving: the data is safety- and grid-compliance-relevant, late or dropped telemetry can mask a developing fault, the platform must reconcile out-of-order events from sites with flaky satellite backhaul, and a duplicate or lost reading is not a rounding error — it corrupts the fault model that decides whether a crew gets dispatched. A nightly batch warehouse is structurally too late. This article is the reference architecture for building that platform properly on Azure: a stateful, exactly-once streaming system on Confluent Kafka and Apache Flink that an operations director, a grid-compliance officer, and a CISO will all sign.
The pressures stack the way they always do in industrial IoT. Volume means roughly half a million events per second at peak across the fleet, sustained, not a demo. Correctness means the windowed aggregates and fault scores that trigger a dispatch must be computed exactly once — double-counting a vibration spike is a false alarm and a wasted truck roll; dropping one hides a real fault. Time means events arrive out of order and late from sites on intermittent links, so “what happened in the 10:00–10:05 window” has to be answered correctly even when some of that window’s data shows up at 10:09. And latency means an operator watching a wall of turbines needs sub-second freshness, while the data-science team needs every raw event retained forever for model retraining and regulatory replay. Streaming with stateful processing is the pattern that satisfies all four at once.
Why not the obvious shortcuts
The naive fixes each fail predictably, and naming why matters because someone on the project will propose all three.
A nightly batch ETL into a warehouse is the incumbent, and it is structurally wrong here: a gearbox fault that develops over an afternoon is invisible until the next morning’s load, by which time the turbine has tripped and the grid penalty is already booked. Dumping raw telemetry straight into a time-series database and querying it scales for dashboards but collapses the moment you need stateful logic — sessionising a fault episode, joining a vibration stream against a slow-changing maintenance-history stream, or computing a five-minute windowed RMS that tolerates late data. The database becomes a place where you re-implement a stream processor badly in SQL and cron. A hand-rolled consumer service that reads the broker and does the math in application code seems tractable until you hit the three hard problems of streaming — exactly-once semantics across a failure, event-time windowing with watermarks, and large keyed state that outgrows memory — at which point you are rebuilding Flink, slowly and with bugs.
Stateful stream processing threads the needle. Kafka gives you a durable, replayable log — the single source of truth that every consumer reads at its own pace, and that you can rewind to reprocess history after a model change. Flink gives you event-time semantics with watermarks (correct windows despite late, out-of-order arrivals), large fault-tolerant keyed state (per-turbine running aggregates that survive node loss), and exactly-once guarantees end to end. The log keeps the raw truth; the processor turns it into the windowed features and fault scores that actually drive a decision.
Architecture overview
The platform runs three distinct paths that share infrastructure but live on different schedules: a low-latency hot path that serves operators and triggers alerts, a lake path that lands every raw event for retention and retraining, and a control path of schemas, secrets, and policy that governs the other two. Keeping them separate in your head is the first step to operating this well.
The defining property of the topology is the one the data-science team cares about most: the Kafka log is the system of record, and every stage downstream is a deterministic, replayable function of it. Nothing computes a fault score from a database row that someone might have mutated; everything derives from the immutable event log, which is what makes a regulatory “show me exactly what this turbine reported at 10:03” defensible.
Hot path, following the data flow:
- Each turbine’s edge gateway publishes telemetry to Azure IoT Hub (or Azure Event Grid MQTT for the pure-MQTT fleets) for device identity, per-device TLS, and bidirectional control. A lightweight bridge — Kafka Connect with the IoT Hub source, or MirrorMaker from a site-local broker — lands those events into Confluent Kafka topics partitioned by
turbineId, so all of one asset’s readings stay ordered on one partition. - Before any event is accepted, its payload is validated against a registered Avro/Protobuf schema in the Confluent Schema Registry. Producers serialise with the registry’s schema ID; a message that does not conform is rejected at the edge of the platform, not discovered three stages later as a
NullPointerException. This schema contract is the seatbelt for the entire pipeline. - Apache Flink, running as a job on AKS via the Flink Kubernetes Operator, consumes the raw topics. It assigns event-time watermarks from each reading’s device timestamp, keys the stream by
turbineId, and maintains large keyed state in RocksDB: rolling vibration RMS, temperature gradients, and a per-turbine fault-detection model’s running features. Tumbling and sliding event-time windows (e.g. 1-minute and 5-minute) compute the aggregates that tolerate late arrivals up to a bounded lateness. - Flink runs the fault logic — threshold breaches, a streaming anomaly score, and a stream-to-stream join of live vibration against a slow-changing maintenance-history stream — and emits two outputs. Enriched serving state (current status, latest aggregates, active alerts per turbine) is upserted into Azure Cosmos DB; derived alert events are written back to a Kafka
alertstopic for downstream consumers. - The operations console and the alerting service read Cosmos DB for sub-second current state, and subscribe to the
alertstopic for push notifications. A confirmed fault auto-raises a work order in ServiceNow, so a crew is dispatched with a ticket and the turbine’s recent telemetry attached — not a phone call and a guess.
Lake path, independent and continuous: a Kafka Connect sink (or a second Flink sink) writes every raw and every derived event to Azure Data Lake Storage Gen2 as partitioned Parquet (farm/date/hour), giving the data-science team an immutable, columnar history for model retraining, ad-hoc analytics in Azure Databricks / Synapse, and regulatory replay. Because the lake is fed from the same log, the offline features and the online features are computed from identical source data — the bug that quietly poisons most ML platforms is designed out here.
Control path: the Schema Registry governs every payload’s shape and enforces compatibility on change; HashiCorp Vault issues short-lived credentials; identity flows Okta → Entra ID; and policy/posture tooling watches the whole estate. These are covered in their own sections below because they are what turns a working pipeline into a platform an enterprise will run.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| Device ingress | Azure IoT Hub / Event Grid MQTT | Per-device identity, TLS, MQTT, C2D control | X.509 per-device certs; device twins; MQTT QoS 1 |
| Event log | Confluent Kafka | Durable, replayable system of record | Partition by turbineId; RF=3; tiered storage for cheap long retention |
| Schema contract | Confluent Schema Registry | Enforce payload shape + compatibility on evolution | Avro/Protobuf; BACKWARD compatibility; serializer-side validation |
| Stream processing | Apache Flink on AKS | Event-time windows, keyed state, exactly-once, joins | Flink K8s Operator; RocksDB state backend; checkpoints to ADLS |
| Serving state | Azure Cosmos DB | Sub-second current status + aggregates per turbine | Partition by turbineId; autoscale RU/s; upsert from Flink sink |
| Data lake | ADLS Gen2 | Immutable raw + derived history (Parquet) | farm/date/hour partitioning; lifecycle tiering to cool/archive |
| Batch / ML | Databricks / Synapse | Retraining, ad-hoc analytics, regulatory replay | Reads ADLS; feature parity with the streaming path |
| Identity / SSO | Okta + Microsoft Entra ID | Workforce SSO (Okta) federated to Entra for native Azure RBAC | OIDC federation; group claims; conditional access |
| Secrets | HashiCorp Vault | Kafka/Schema-Registry creds, sink keys, signing material | Entra auth method; dynamic short-lived leases; Agent sidecar injection |
| Edge / CDN | Akamai | TLS, WAF, anycast for the operations console + APIs | WAF rules; origin shield to the private ingress |
| CSPM / data posture | Wiz + Wiz Code | Cloud posture, sensitive-data exposure, IaC scanning | Agentless scan of ADLS/Cosmos; Wiz Code gate in CI on Terraform |
| Runtime security | CrowdStrike Falcon | Runtime threat detection on AKS nodes + connector VMs | Sensor on node pools; detections piped to the SOC |
| Observability | Dynatrace | Distributed tracing, consumer-lag + checkpoint telemetry | OneAgent on AKS; OTel spans; Davis anomaly detection on lag |
| ITSM / approvals | ServiceNow | Auto work orders on fault; schema-change approvals | Change gate on schema/topic changes; auto-ticket on confirmed fault |
| CI / IaC | GitHub Actions + Argo CD + Terraform/Ansible | Build/test/deploy Flink jobs; provision infra; config edge fleet | OIDC to Azure; Argo CD GitOps for AKS; Ansible for connector hosts |
A few of these choices deserve the why, because they are the ones teams get wrong.
Why Kafka and Flink, not one or the other. Kafka is a log, not a processor: it stores and replays events durably but does not, by itself, compute a windowed RMS that tolerates late data or hold per-turbine state across a node failure. Flink is a processor, not a store: it computes brilliantly but needs a durable, replayable source to be correct after a crash. Together, Kafka is the source of truth and the buffer that absorbs a Flink restart; Flink is the brain that turns the log into decisions. Kafka Streams or ksqlDB can do lighter in-broker processing and are simpler to operate — but they top out before the large keyed state, sophisticated event-time semantics, and the broker-agnostic exactly-once joins this fleet needs, which is the line where you graduate to Flink.
Why partition by turbineId. Kafka guarantees ordering only within a partition. Keying by turbineId keeps every reading from one asset on one partition and therefore in order, which is exactly what a per-turbine fault model requires — and it makes Flink’s keyed state and the Cosmos upsert naturally co-partitioned. The tradeoff is hot partitions if one farm dominates traffic; the answer is enough partitions to spread load and a partition count chosen for your peak consumer parallelism, not your current one.
Why exactly-once is non-negotiable here. At-least-once delivery is fine for a dashboard that can tolerate a flicker; it is wrong for a fault score that dispatches a truck. Flink achieves end-to-end exactly-once via checkpoint barriers and the two-phase-commit sink protocol: it snapshots operator state and source offsets together to ADLS, and only commits Kafka and Cosmos writes when the checkpoint completes, so a mid-stream failure rewinds to the last consistent snapshot with no double-count and no gap. The cost is checkpoint overhead and a small latency floor — covered honestly in tradeoffs.
Implementation guidance
Provision with Terraform, configure the fleet with Ansible, and treat the schema as a first-class deliverable. The order matters: a topic or sink that goes live before its schema is registered is a data-quality incident waiting to happen.
- The network: a hub/spoke or single VNet with subnets for AKS, the connector hosts, and private endpoints; Private Endpoints for ADLS, Cosmos DB, and Key Vault with public network access disabled, plus the matching
privatelinkprivate DNS zones (privatelink.blob.core.windows.net,privatelink.documents.azure.com,privatelink.vaultcore.azure.net). Forgetting a DNS zone link is the classic silent-hang failure on Azure private networking. - Confluent Kafka — Confluent Cloud on Azure for a managed control plane, or Confluent Platform on AKS for full control — with topics, partition counts, replication factor 3, and tiered storage so long retention spills to object storage cheaply.
- The AKS cluster (Azure CNI, Workload Identity on) with the Flink Kubernetes Operator installed; Flink jobs deployed declaratively and reconciled by Argo CD from the GitOps repo.
- Cosmos DB and ADLS Gen2, each with
public_network_accessdisabled and a Private Endpoint. - Ansible configures the edge gateway fleet and the connector VMs — MQTT endpoints, certificates, agent installs — idempotently across hundreds of remote sites.
A minimal Flink job shape communicates the intent — event-time, keyed state, exactly-once:
// Read raw telemetry, assign event-time watermarks tolerant of late data
DataStream<Reading> readings = env
.fromSource(kafkaSource, WatermarkStrategy
.<Reading>forBoundedOutOfOrderness(Duration.ofSeconds(30))
.withTimestampAssigner((r, ts) -> r.deviceTimestampMillis()),
"turbine-telemetry");
// Per-turbine, 5-minute sliding window of vibration RMS, exactly-once to Cosmos + alerts
readings
.keyBy(Reading::getTurbineId)
.window(SlidingEventTimeWindows.of(Time.minutes(5), Time.minutes(1)))
.aggregate(new VibrationRmsAggregate()) // keyed state in RocksDB
.process(new FaultScorer()) // emits status + alerts
.sinkTo(cosmosUpsertSink); // 2-phase-commit, exactly-once
And the checkpoint configuration that underpins the guarantee:
# flink-conf — exactly-once with durable, incremental checkpoints to ADLS
execution.checkpointing.mode: EXACTLY_ONCE
execution.checkpointing.interval: 30s
state.backend: rocksdb
state.backend.incremental: true
state.checkpoints.dir: abfss://checkpoints@stwindturbine.dfs.core.windows.net/flink
The pipeline that builds and deploys these jobs runs in GitHub Actions, authenticating to Azure via OIDC federation so there is no stored service-principal secret to leak — a hard lesson this platform team intends never to repeat — with Argo CD doing the actual GitOps rollout onto AKS and Terraform/Ansible managing infrastructure and fleet config respectively. Wiz Code scans the Terraform in the same pipeline and blocks a merge that would, say, open a storage account to the public internet.
Schema evolution without breaking the fleet. Register every payload in the Schema Registry with BACKWARD compatibility so new producers can add optional fields without breaking older consumers; a change that would break compatibility is rejected at CI time, not at 2 a.m. in production. A schema change is also a governed event: it passes a ServiceNow change approval before the new version is promoted, giving compliance a documented record of every shape the telemetry has ever taken — which matters when a regulator asks what vibration_rms meant in last March’s data.
Enterprise considerations
Security & Zero Trust. The architecture is Zero Trust by construction: identity-based access, least-privilege RBAC scoped per resource, no public data-plane surface on ADLS, Cosmos, or Key Vault. Layer on top: (a) human SSO flows Okta → Entra ID — operators and engineers log in once with the company’s Okta credentials and conditional-access policies, Okta federates to Entra over OIDC, and the Entra token carries the group claims that gate the operations console and the Azure RBAC roles; (b) device identity is X.509 per-turbine certificates in IoT Hub, so a compromised gateway can be revoked individually without touching the fleet; © the few secrets that are not managed identities — Kafka API keys, Schema Registry credentials, third-party sink tokens — live in HashiCorp Vault, leased dynamically and injected by the Vault Agent sidecar, never written to a Kubernetes Secret; (d) Wiz runs continuous CSPM and sensitive-data-exposure scanning across ADLS and Cosmos, alerting the moment a resource drifts to public exposure, with Wiz Code catching the same class of mistake earlier in IaC; (e) CrowdStrike Falcon sensors on the AKS node pools and connector VMs provide runtime threat detection feeding the SOC; (f) Akamai fronts the operations console and public APIs with TLS, WAF, and bot mitigation. Azure Policy denies any storage or Cosmos resource created with public network access, and Wiz independently verifies the policy is actually holding.
Cost optimization. Throughput-driven spend dominates and grows with the fleet, so engineer for it from day one.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Kafka tiered storage | Spill long retention to object storage, keep hot data on brokers | Cuts broker disk cost dramatically for long replay windows |
| ADLS lifecycle tiering | Auto-move Parquet hot→cool→archive by age | Slashes the cost of “retain everything forever” |
| Cosmos autoscale RU/s | Scale request units to actual load, floor low overnight | Avoids paying peak RU/s 24×7 |
| Right-sized partitions | Match Kafka partitions + Flink parallelism to peak, not vanity | Avoids over-provisioned, idle compute |
| Edge pre-aggregation | Down-sample low-value channels at the gateway | Trims ingress volume before it costs anything downstream |
Meter ingress volume and Cosmos RU consumption per farm and pipe the metrics to Dynatrace, which the platform team uses for the per-site cost dashboard the CFO sees.
Scalability. Each tier scales independently. Kafka scales by adding brokers and partitions (size partition count for peak consumer parallelism, since you cannot easily reduce it later). Flink scales by parallelism — more task slots and AKS nodes — and, because keyed state is partitioned by key, it rescales by redistributing key groups; large state lives in RocksDB on local SSD, not heap, so a turbine fleet’s worth of running aggregates does not blow the JVM. Cosmos DB scales out by physical partition on turbineId and autoscales RU/s. ADLS is effectively unbounded. The natural ceiling is Flink’s stateful rescaling, which is why partition and key-group counts are chosen with the target fleet size in mind, not today’s.
Failure modes, and what each one looks like. Name them before they page you.
- A flapping site link — telemetry arrives in a late burst when a satellite link recovers. Mitigation: event-time watermarks with a bounded out-of-orderness, plus a side-output for the truly-too-late events so they are captured to the lake rather than silently dropped from a window.
- A Flink job crash — a node dies mid-window. Mitigation: this is the designed-for case — the job restarts from the last checkpoint, rewinds Kafka offsets, and the two-phase-commit sinks ensure no double-count and no gap. The only visible effect is a brief processing-latency bump.
- Consumer lag blowout — a downstream consumer or sink slows and lag climbs. Mitigation: alert on lag in Dynatrace (the single most important streaming SLI), autoscale Flink parallelism, and rely on Kafka’s durable log to buffer until the consumer catches up.
- A poison message — a malformed or schema-violating payload. Mitigation: the Schema Registry rejects non-conforming payloads at ingest, and Flink routes any record that still fails processing to a dead-letter topic rather than crash-looping the job.
- State corruption or a bad deploy — a buggy Flink version computes wrong scores. Mitigation: because Kafka is replayable, you fix the job and reprocess from a retained offset to rebuild correct state — the superpower of a log-centric design.
- Regional outage — see DR below.
Reliability & DR (RTO/RPO). Decide the numbers per tier. Kafka replication factor 3 across availability zones gives broker fault tolerance; cross-region DR uses Cluster Linking / MirrorMaker 2 to mirror topics to a paired region. Flink recovers from checkpoints in ADLS (geo-redundant), so a regional failover restarts jobs against the mirrored cluster from the last durable checkpoint. Cosmos DB with multi-region writes gives near-zero RPO and seconds RTO for serving state. ADLS geo-redundant storage is the durable backstop — and because every derived dataset is a replayable function of the raw log, the ultimate recovery guarantee is “rebuild from the events.” A pragmatic target for this platform: RTO 15 minutes, RPO under 1 minute for the hot path, with the lake and all derived state rebuildable from geo-redundant raw events if needed. Akamai health checks drive edge failover for the console.
Observability. This platform lives or dies on streaming-specific SLIs, so instrument them first in Dynatrace with OneAgent on AKS and OpenTelemetry spans across the Flink job: consumer lag per topic/partition (the canary for everything), end-to-end event latency (device timestamp → Cosmos upsert), checkpoint duration and failure rate (a checkpoint that starts timing out predicts a job in trouble), records-in/out and backpressure per operator, and dead-letter / too-late event counts. Davis anomaly detection on lag and checkpoint duration surfaces a degrading pipeline before it falls over. Emit the business metrics too — active faults per farm, mean time from fault onset to work order, and ingress volume per site — because those are what the asset-performance director judges the platform on. Datadog is an equally valid choice if the wider estate already standardises on it; the SLIs are the same.
Governance. Pin schema versions explicitly and promote them through the compatibility gate; never let a producer serialise against an unregistered schema. Keep Flink job code, topic definitions, and Terraform in version control, reviewed and revertable, with Argo CD making the deployed state an auditable reflection of Git. Apply Azure Policy to deny public network access and require diagnostic settings on every relevant resource, with Wiz as the independent check that the controls are real. Retain the raw event log for the regulator’s required window, and treat the Schema Registry history plus the immutable ADLS lake as the audit record of exactly what every asset reported and meant — the artefact that turns a grid-compliance query from a panic into a query.
Explicit tradeoffs
Accept these or do not build it. Stateful streaming is genuinely harder to operate than a batch warehouse: you are now running a distributed stateful processor (Flink), a distributed log (Kafka), a schema registry, and the discipline to keep them coherent — that is real platform-team headcount, not a cron job. Exactly-once is not free: checkpoint barriers and two-phase commit add overhead and a latency floor, so if your use case genuinely tolerates at-least-once (a pure dashboard), you are paying for a guarantee you do not need. Event-time correctness costs latency by design: to handle late data you hold windows open for a bounded lateness, which trades immediacy for correctness — tune that bound to your worst site link, and accept that “correct in five minutes” beats “wrong now” for a dispatch decision but not for a raw live gauge. And the operational surface is wide — Kafka partitions, Flink parallelism and state, private DNS, schema compatibility — where the price of forgetting one piece (an unlinked DNS zone, an under-partitioned topic) is a silent hang or a non-rescalable bottleneck rather than a clear error.
The alternatives, and when they win. If your telemetry is low-volume and you only need dashboards and simple thresholds, Azure Stream Analytics or a managed time-series database (Azure Data Explorer) is far simpler and a perfectly good answer — graduate to Kafka + Flink when you need large keyed state, exactly-once joins, or replayable reprocessing. If your processing is light and lives entirely inside the broker, Kafka Streams or ksqlDB avoids standing up a separate Flink cluster. If you are all-in on Databricks, Spark Structured Streaming covers much of this with one fewer system to run, at the cost of Flink’s lower-latency, finer-grained state model. The architecture here is the destination for a high-volume, correctness-critical, replay-required industrial IoT platform; start narrower if your volume and correctness needs are modest, but this is where a fleet-scale predictive-maintenance and compliance platform has to land.
The shape of the win
For the wind operator, the payoff is not “a real-time dashboard.” It is that a gearbox bearing’s vibration signature crosses a learned threshold at 10:03, a fault score updates in Cosmos within a second, an alert fires and a ServiceNow work order is raised automatically with the turbine’s recent telemetry attached, and a crew is dispatched before the turbine trips — so the unplanned curtailment, and the grid penalty that comes with it, never happens. And when the regulator later asks what that asset was doing the moment it derated, the answer is an exact replay from the immutable log, schema-versioned and auditable. That is the sentence that funds the platform. Everything upstream — the Confluent log as system of record, Flink’s exactly-once event-time state, the Schema Registry contract, the Okta-to-Entra identity, the Vault-held secrets, the Wiz posture scanning, the CrowdStrike runtime sensors, the Dynatrace lag and checkpoint SLIs — exists to make an operations director, a grid-compliance officer, and a CISO each say yes. The architecture here is the destination; start narrower if you must, but this is where correctness-critical IoT telemetry at fleet scale has to land.