A tier-one automotive parts supplier — the kind that stamps and welds body panels for three of the big OEMs — gets a number from its plant director that changes the conversation: an unplanned stoppage on the main stamping line costs roughly ₹18 lakh an hour in idle labour, missed just-in-sequence deliveries, and OEM line-down penalties, and last quarter the line went down eleven times for a hydraulic press fault that, in hindsight, had been telegraphing itself in the vibration data for days. The board’s ask is blunt: “see the failures before they happen, and never let a machine on the floor be the thing that gets us breached.” The constraints are the ones every factory carries. The plant network is air-gapped from corporate IT by an OT/IT boundary the security team will not soften. The presses, robots, and PLCs speak OPC UA and Modbus, not HTTPS. Connectivity to Azure is a single, sometimes flaky, site link — the cloud cannot be in the hot path of a safety decision. And the CISO has read enough about factory ransomware to veto anything that puts an unmanaged Linux box on the OT VLAN. This article is the reference architecture for an IoT edge-to-cloud platform that earns predictive maintenance and OEE analytics without compromising the plant floor — the kind of design a plant director and a CISO will both sign.
The pressures stack the way they always do in manufacturing. Uptime means a model that flags a bearing two days out, computed where the data is born, not in a datacentre 600 km away. Latency and autonomy mean the factory keeps running — and keeps inferring — when the WAN link drops, because a stamping line cannot pause for a cloud round-trip. Data gravity means a single press generates more high-frequency telemetry in an hour than you would ever ship raw over a site link, so you decide at the edge what is worth sending. And security means treating every device on the floor as a potential entry point, because in OT it is. The edge-to-cloud pattern satisfies all four: heavy, low-latency work — protocol translation, ML inference, buffering — runs on a hardened gateway on the plant floor, while the cloud does what only the cloud can do well — model training, a plant-wide digital twin, long-horizon analytics, and fleet management across sites.
Why not the obvious shortcuts
The naive fixes each fail predictably on a factory floor, and naming why matters because someone on the project will propose all three.
“Just stream every sensor straight to the cloud” ignores data gravity and the WAN. A vibration sensor sampled at 10 kHz on each of forty machines is terabytes a day; the site link cannot carry it, the cloud ingest bill is absurd, and the moment the link blips you are blind. “Run the ML model in the cloud and call it from the floor” puts a wide-area network in the path of a time-sensitive decision — when the link is up you pay round-trip latency, and when it is down the line has no anomaly detection at all, which is exactly when a struggling press most needs watching. “Put a PC on the OT VLAN with a Python script” is how factories get ransomware: an unmanaged, unpatched, unmonitored box bridging the plant network to the internet is the attack path every OT security report warns about.
The edge-to-cloud split threads the needle. A managed Azure IoT Edge gateway sits at the OT/IT boundary, speaks the machines’ native protocols, runs the anomaly model locally so inference survives a WAN outage, buffers telemetry when the link drops and back-fills when it returns, and sends only the distilled signals worth keeping — features, aggregates, anomaly events — north to Azure. The cloud trains the models the edge runs, maintains a live digital twin of the whole plant, and gives the reliability team plant-wide and cross-site analytics. The floor stays autonomous; the cloud stays authoritative.
Architecture overview
The platform runs two distinct paths that share infrastructure but live on opposite sides of the OT/IT boundary and on different schedules: a near-real-time uplink path that carries floor telemetry and events into Azure, and a control / deployment path that pushes models, modules, and configuration back down to the gateways. Keeping them separate in your head is the first step to operating this well.
The defining property of the topology is the one the security team cares about most: the gateway is the only thing that crosses the OT/IT boundary, it initiates all connections outbound, and nothing in Azure can dial into the plant network. The PLCs and robots talk only to the local gateway over the OT VLAN; the gateway talks only to Azure IoT Hub over an outbound MQTT/AMQP connection on 8883/443. No inbound port is ever opened toward the floor — which is what makes the OT story defensible to a CISO.
Uplink path, following the data flow:
- Machines on the floor — stamping presses, weld robots, conveyors — expose data over OPC UA and Modbus. An OPC Publisher module on the Azure IoT Edge gateway subscribes to the tags that matter (press tonnage, ram velocity, hydraulic pressure, motor current, bearing vibration) and normalizes them into a common schema.
- A custom stream-processing module on the gateway windows and featurizes the high-frequency signals locally — RMS vibration, kurtosis, temperature slope — so the terabytes stay on the floor and only kilobytes of features move.
- An on-prem ML inference module (an ONNX anomaly-detection / remaining-useful-life model running in the IoT Edge runtime) scores those features locally, every cycle, with no cloud dependency. A score crossing threshold raises a local anomaly event immediately — the line can react before the cloud has even heard about it.
- The IoT Edge runtime routes messages to Azure IoT Hub over an outbound connection. When the WAN link drops, the runtime stores messages locally and forwards them when connectivity returns, so an outage costs you back-filled history, not lost data.
- IoT Hub ingests device-to-cloud messages and, via message routing, fans them out: telemetry to analytics storage, and critical anomaly events to Azure Event Grid, which pushes them to subscribers — an Azure Function that opens a ServiceNow work order, a Logic App that pages the on-call reliability engineer, and the digital-twin updater.
- Azure Digital Twins holds a live graph model of the plant — every press, robot, cell, and line as a twin with current state — updated from the telemetry and anomaly stream so a single API answers “what is the health of Line 3 right now.”
- Telemetry lands in Azure Data Explorer (the engine behind Time Series Insights) for high-volume time-series storage, ad-hoc KQL exploration, and OEE / trend dashboards the reliability team lives in. A copy lands in a lake for model retraining.
Control / deployment path, independent and outbound-pull: the gateway’s module set, the inference model, and per-machine configuration are described as an IoT Edge deployment manifest. The gateway pulls its assigned modules and the latest model from Azure Container Registry; a new model trained in the cloud is promoted by updating the manifest, and every gateway in the fleet converges to it on its own schedule. Nothing is ever pushed into the plant network from outside — the device always initiates.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| Edge gateway | Azure IoT Edge | Protocol translation, local featurization, local inference, store-and-forward | Hardened Linux host on OT/IT boundary; modules from ACR; outbound-only |
| Protocol ingest | OPC Publisher + Modbus module | OPC UA / Modbus → normalized message schema | Subscribe only to needed tags; sampling/publishing intervals tuned per machine |
| On-prem inference | ONNX model in IoT Edge | Anomaly / remaining-useful-life scoring at the source, offline-capable | Versioned model image; CPU or local GPU; runs every cycle |
| Cloud ingest | Azure IoT Hub | Device identity, secure D2C/C2D, message routing | Per-device X.509 auth; routes split telemetry vs. events; DPS for provisioning |
| Event fan-out | Azure Event Grid | Push critical events to subscribers (ITSM, paging, twin) | Event-type filtering; dead-letter to Storage |
| Plant model | Azure Digital Twins | Live graph of machines/cells/lines and their state | DTDL models per asset type; updated from telemetry + anomalies |
| Time-series analytics | Time Series Insights / Azure Data Explorer | High-volume telemetry store, KQL, OEE & trend dashboards | Hot/cold retention; KQL functions for OEE; Grafana/dashboards |
| Device security | CrowdStrike Falcon | Runtime threat detection on edge gateways (and OT-aware monitoring) | Falcon sensor on every gateway; detections to the SOC; USB/exec controls |
| Identity / SSO | Entra ID + Okta | Engineer/operator SSO to dashboards and fleet tooling | Okta workforce IdP federated to Entra; conditional access; RBAC on twins/ADX |
| Secrets | HashiCorp Vault | Provisioning keys, signing keys, third-party API tokens | Short-lived leases; gateways fetch enrolment secrets, never hard-coded |
| CSPM / posture | Wiz (+ Wiz Code) | Cloud posture; IoT misconfig & exposure; IaC scanning pre-merge | Agentless scan of Hub/Storage/ADX; Wiz Code gates Terraform PRs |
| Observability | Dynatrace / Datadog | Health of the cloud pipeline and the edge fleet; anomaly detection | Pipeline tracing; per-gateway module metrics; alert on offline devices |
| ITSM / work orders | ServiceNow | Auto-raise maintenance work orders and incidents from events | Event Grid → Function → ServiceNow create; CMDB sync of assets |
| Content / TLS edge | Akamai | TLS, WAF, anycast for the operator/fleet web portals | WAF on the management portal origin; bot mitigation |
| CI / IaC | GitHub Actions / Jenkins + Argo CD; Terraform / Ansible | Build module images, ship manifests, provision cloud + harden gateways | OIDC to Azure; Argo CD GitOps for edge manifests; Ansible bakes the gateway image |
| Training / upskilling | Moodle | Operator & technician training on the new tooling and runbooks | SSO via Okta; courses gated before floor access to the dashboards |
A few of these choices deserve the why, because they are the ones teams get wrong.
Why inference runs on the edge, not the cloud. The whole value proposition — catch a fault before it stops the line — depends on scoring every machine cycle with sub-second latency and during a WAN outage. Putting the model in Azure means each inference pays a wide-area round-trip, and the moment the site link blips you have zero anomaly detection precisely when a struggling press needs it most. Running an ONNX model inside the IoT Edge runtime makes inference a local, deterministic, offline-capable operation; the cloud’s job is to train the model and ship a better one, not to answer per-cycle questions.
Why a digital twin instead of just dashboards. Time-series charts tell you a value’s history; they do not tell you that Press-07 feeds Cell-B which feeds Line-3, so an anomaly on one machine is about to starve a downstream cell. Azure Digital Twins models those relationships in a queryable graph (DTDL), so applications reason about the plant, not isolated tags — “show every asset upstream of the stoppage,” “roll Line-3 health up from its cells.” The twin is also the clean integration surface: ServiceNow, dashboards, and what-if simulations all read one model instead of re-deriving topology.
Why X.509 per-device identity and provisioning, not a shared key. A factory fleet is dozens to thousands of gateways; a shared connection string is a single stolen secret away from a fleet compromise, and rotating it is a nightmare. Each gateway gets its own X.509 certificate, enrolled through the Device Provisioning Service, so identity is per-device, revocable individually, and never a shared secret on disk. The enrolment material the gateway needs at first boot is leased from HashiCorp Vault, short-lived, rather than baked into an image.
Implementation guidance
Provision with Terraform, harden the gateway with Ansible, and treat the OT/IT boundary as the first deliverable. The deployment order matters: the network and identity story has to be right before a single device connects.
- The cloud spine in Terraform — IoT Hub, Device Provisioning Service, Event Grid topics, Azure Digital Twins instance, Azure Data Explorer cluster, the lake, and Container Registry.
- DPS enrolment groups keyed to your X.509 CA, so a new gateway auto-provisions to the right Hub with its own identity on first boot.
- The edge gateway image baked by Ansible — a minimal hardened Linux host, the IoT Edge runtime, the Falcon sensor, and host firewall rules that allow outbound 8883/443 to Azure and nothing inbound from the WAN.
- The module images (OPC Publisher, featurizer, ONNX inference) built in CI and pushed to ACR; the deployment manifest under version control.
- Operator and fleet portals behind Akamai for TLS, WAF, and bot mitigation.
A minimal IoT Edge deployment manifest communicates the intent — pull the right modules, route telemetry and events separately, and keep a store-and-forward buffer:
{
"modulesContent": {
"$edgeAgent": { "properties.desired": {
"modules": {
"opcpublisher": { "settings": { "image": "acr.azurecr.io/opc-publisher:2.9" } },
"featurizer": { "settings": { "image": "acr.azurecr.io/featurizer:1.4" } },
"inference": { "settings": { "image": "acr.azurecr.io/rul-onnx:2026.05" } }
}
}},
"$edgeHub": { "properties.desired": {
"routes": {
"telemetryUp": "FROM /messages/modules/featurizer/* INTO $upstream",
"anomalyUp": "FROM /messages/modules/inference/outputs/anomaly INTO $upstream"
},
"storeAndForwardConfiguration": { "timeToLiveSecs": 86400 }
}}
}
}
The timeToLiveSecs: 86400 is the line that lets a gateway ride out a full-day WAN outage and back-fill on reconnect. Promote a new model by bumping the inference image tag in the manifest and letting the fleet converge — managed via Argo CD doing GitOps on the manifest repo, with GitHub Actions (or Jenkins, where a plant already runs it) building and signing the images. Wiz Code scans the Terraform on every pull request so a misconfigured Hub or a publicly exposed storage account is blocked before merge, not discovered in production.
Identity: federate the humans, per-device the machines. Engineers and reliability techs reach the Digital Twins explorer, the ADX dashboards, and the fleet console through Okta as the workforce IdP, federated to Microsoft Entra ID so Azure RBAC is native — least-privilege roles on the twin graph and the ADX database, with conditional access. Devices never use those human identities: each gateway authenticates to IoT Hub with its X.509 cert via DPS. The residual secrets that are not managed identities — DPS group enrolment keys, third-party maintenance-API tokens, image-signing keys — live in HashiCorp Vault, leased dynamically, so nothing long-lived sits on a gateway or in a pipeline.
Twin modelling. Define a DTDL model per asset type (press, robot, conveyor, cell, line), with telemetry properties and relationships (feeds, partOf). Update twin state from the telemetry/anomaly stream through an Event Grid–triggered Function, and keep the twin’s IDs aligned with the ServiceNow CMDB so a work order references the same asset the twin and the dashboards do — one identity for a machine across the whole stack.
Enterprise considerations
Security & Zero Trust — and OT is the hard part. The architecture is Zero Trust by construction at the boundary: outbound-only connectivity, no inbound path to the floor, per-device identity, least-privilege RBAC on the cloud side. Layer on top: (a) CrowdStrike Falcon sensors on every edge gateway for runtime threat detection, USB and process-execution controls, and OT-aware monitoring — because the gateway is the one box bridging plant and cloud and is the prize an attacker wants, and its detections feed the SOC; (b) Wiz running continuous CSPM across IoT Hub, Storage, ADX, and Digital Twins, alerting the moment a resource drifts to public exposure or an over-broad access policy, with Wiz Code catching the same classes of mistake in IaC pre-merge; © Azure Policy denying any IoT or storage resource created with public network access, with Wiz as the independent check that the policy actually holds; (d) a Falcon detection or a sustained device-offline anomaly auto-raises a ServiceNow incident so security has a ticket, not just a log line. The OT VLAN itself stays segmented; the gateway is a deliberate, monitored, hardened crossing point, never an accidental one.
Cost optimization. The dominant cost lever is decided at the edge, not in a billing console.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Edge filtering | Featurize/aggregate on the gateway; send signals, not raw waveforms | Cuts cloud ingest & egress by orders of magnitude |
| IoT Hub tier sizing | Match unit tier to message volume, not device count | Avoids over-provisioning the Hub |
| ADX hot/cold cache | Keep recent data hot, age the rest to cold storage | Big saving on long-retention time-series |
| Batched D2C | Batch telemetry messages from the gateway | Fewer billed messages for the same data |
| Right-size gateways | One capable gateway per area vs. one per machine | Fewer managed hosts to license and patch |
Edge filtering is the headline: deciding at the source that a 10 kHz waveform becomes a handful of features per window is the difference between an affordable platform and an unaffordable one. Meter ingest and per-area device cost in Dynatrace / Datadog so the plant owns its spend.
Scalability — two axes, plant and fleet. Within a plant, you scale by adding gateways per production area (each owning its machines) and by tuning OPC sampling/publishing intervals so the Hub sees a sane message rate. Across the enterprise, the same manifest-and-DPS pattern onboards a second and third plant with no new design — DPS auto-provisions new gateways, Argo CD converges them to the right module set, and Digital Twins gains a site sub-graph. IoT Hub scales by units and tier; ADX scales out its cluster for query and ingest. The natural ceiling is per-Hub device and throughput limits, which is why a multi-plant rollout plans Hub-per-region (or IoT Central / Hub scale-units) early.
Failure modes, and what each one looks like. Name them before they page you.
- WAN link outage. The site link to Azure drops mid-shift. Without store-and-forward you lose telemetry and cloud visibility; with it, local inference keeps protecting the line and the gateway back-fills history on reconnect — the outage costs latency on the cloud view, not safety on the floor. Mitigation: store-and-forward TTL sized to your worst realistic outage; alert on device-offline in Dynatrace/Datadog.
- Gateway hardware failure. A gateway dies and an area goes dark. Mitigation: gateways are cattle — a spare boots, DPS re-provisions it by certificate, Argo CD restores its modules, and it rejoins in minutes; keep cold spares per plant.
- Model drift / bad model. A retrained model raises false alarms or misses faults. Mitigation: version every model image, canary it to a few gateways before fleet-wide promotion, and keep instant rollback by reverting the manifest tag.
- Event storm. A line trips and a thousand correlated anomalies flood Event Grid and ServiceNow. Mitigation: de-duplicate/correlate at the edge and in the Function, raise one work order per fault with the correlated context, dead-letter the overflow.
- Certificate expiry. A gateway’s X.509 cert lapses and it silently stops connecting. Mitigation: monitor cert expiry as a first-class metric and automate renewal through DPS.
Reliability & DR (RTO/RPO). Decide the numbers per tier, and note the floor’s are the lenient ones — the line keeps running offline by design. For the cloud: ADX with follower databases or a paired-region cluster protects analytics; the lake (geo-redundant) is the durable source of truth for retraining; Digital Twins is rebuildable from the asset model in source control plus a state replay. A pragmatic target: cloud-pipeline RTO 30 minutes, RPO near-zero for telemetry (store-and-forward back-fills the gap), with the twin and dashboards reconstructable from IaC and the lake. The plant floor’s effective RTO for inference is zero — that is the entire point of edge autonomy. Akamai health checks drive failover for the management portals.
Observability. Instrument both halves. In the cloud, trace the pipeline — Hub → Event Grid → Function → Twin/ServiceNow — in Dynatrace / Datadog, with anomaly detection on ingest rate and event latency. For the edge fleet, the metrics that matter are operational: per-gateway module health, messages buffered (store-and-forward depth — a rising buffer means the link is struggling), time since last cloud contact, inference latency, and device-offline count. A gateway that goes quiet is the single most important alert in a connected factory; wire it to page. Emit OEE (availability × performance × quality) from ADX as the business metric the plant director watches.
Governance & change. Treat an edge deployment like a production release: manifests in version control, model versions pinned (never a floating latest), promotion through canary and rollback. New gateway rollouts and model promotions pass a ServiceNow change gate so the plant has a documented record — this is OT, where an unreviewed change can stop a line. And because the tooling is new to the floor, run operators and maintenance technicians through Moodle courses (SSO via Okta) on the dashboards, the anomaly workflow, and the runbooks, gated before they get dashboard access — a trained operator who trusts the alert is what turns a prediction into a prevented stoppage.
Explicit tradeoffs
Accept these or do not build it. Edge-to-cloud adds real operational surface: you now run a fleet of edge devices — to patch, monitor, secure, and certificate-manage — on top of the cloud platform, and that fleet lives in a harsh, sometimes-disconnected, security-sensitive environment. The split-brain design (local inference, cloud truth) means you maintain a model in two places — trained in the cloud, executed at the edge — and need a disciplined promotion pipeline so they stay in step. The OT/IT boundary that makes the CISO sign costs you the convenience of ever reaching into the floor: no inbound debugging shortcut, everything device-initiated. And the digital twin is genuine modelling work — DTDL per asset, kept in sync with reality and with the CMDB — that a pile of dashboards would let you skip.
The alternatives, and when they win. If you have a handful of machines at a single site with a reliable link and no hard offline requirement, streaming straight to IoT Hub and inferring in the cloud is simpler — skip the edge ML entirely. If you want a lower-code, opinionated path and can accept its guardrails, Azure IoT Central gives you a managed application platform over the same Hub primitives and stands up a fleet faster. If your need is pure historian-style analytics with no real-time control, a classic OPC historian plus a BI tool may be all you require. And if you are early and exploring, start with one gateway, one line, and the cloud-inference path, then graduate to on-prem inference, the digital twin, and the full fleet pattern when uptime, autonomy, data gravity, or multi-plant scale demand it.
The shape of the win
For the supplier’s plant, the payoff is not “a dashboard.” It is that the hydraulic press’s vibration signature drifts on a Tuesday, the on-floor model flags a rising remaining-useful-life risk that shift, Event Grid opens a ServiceNow work order automatically, a technician — trained on the workflow in Moodle and trusting the alert — swaps the bearing during the scheduled Thursday changeover, and the line that used to stop eleven times a quarter for ₹18 lakh an hour simply does not stop. And it happened on a gateway that the CISO is comfortable with, because that gateway is Falcon-protected, identity is per-device through DPS, secrets are Vault-leased, the boundary is outbound-only, and Wiz is independently watching the cloud side hold. Everything upstream — the edge inference, the store-and-forward buffer, the digital twin, the X.509 fleet identity, the CrowdStrike sensor, the Dynatrace fleet view — exists to make a plant director and a CISO each say yes to the same platform. Start with one line if you must; this is where a connected, secure, predictive factory has to land.