Industrial and connected-product companies sit on a paradox: their machines emit more telemetry than ever, yet the data lands in a dozen disconnected silos — a SCADA historian here, a CSV export there, a vendor cloud nobody can query. The result is that the most operationally valuable signal in the business (what the physical assets are actually doing, second by second) is the hardest to analyze. This article lays out a reusable AWS architecture that turns raw device telemetry into governed, queryable, dashboard-ready analytics, scaling from a single pilot line to a global fleet of millions of devices.
The business scenario
Picture three companies that look different but share the same shape of problem.
A mid-size HVAC manufacturer ships 40,000 rooftop units a year, each with a cellular modem. Warranty claims are killing margins, and the only way to know a compressor is failing is when a customer calls. They want predictive maintenance, but the telemetry never leaves the unit’s flash memory.
A regional water utility runs 1,200 pumping stations with PLCs speaking Modbus and OPC-UA. Their historian retains 90 days at full resolution, then downsamples and effectively forgets. Regulators now demand multi-year auditable records of flow, pressure, and chlorine dosing, and the operations team wants a single pane of glass instead of 1,200 vendor HMIs.
A logistics operator has 8,000 refrigerated trailers (reefers) with GPS, temperature, door, and fuel sensors reporting every 30 seconds. Cold-chain compliance fines are real money, and a single spoiled load can cost more than a year of cloud bills.
Every one of these is the same architecture under the hood: high-cardinality device telemetry, ingested securely at the edge, processed in motion, stored as time-series, and surfaced to both engineers and business users. What differs is scale (thousands vs. millions of devices), protocol (MQTT vs. industrial fieldbus), and latency tolerance (a reefer alarm needs seconds; a warranty trend can wait minutes).
The non-negotiable requirements that recur across all three:
- Secure per-device identity — no shared secrets, revocable in seconds, surviving a stolen device.
- Lossless ingestion that absorbs reconnection storms (a cell tower flaps and 10,000 devices reconnect at once).
- Hot path and warm path — real-time alerting in seconds, plus historical analytics over months or years.
- Asset context — raw
temp=4.2is useless;Trailer 7731 / Reefer Unit / Evaporator Coil = 4.2°Cis actionable. - Self-service dashboards for non-engineers, without exporting data to laptops.
- Predictable cost that scales sub-linearly with device count.
This is the canonical AWS IoT analytics stack: IoT Core → Kinesis → Timestream / SiteWise → QuickSight, with S3 as the durable backbone.
Architecture overview
The end-to-end data path moves left to right, from physical asset to business insight, with a clean split between a low-latency “hot path” and a high-throughput “warm/cold path.”
1. Edge & connectivity. Devices connect to AWS IoT Core over MQTT (TLS 1.3, mutual auth with X.509 certificates) or, for constrained cellular fleets, MQTT-over-WebSockets. Industrial sites that speak Modbus/OPC-UA don’t talk MQTT natively, so a gateway running AWS IoT Greengrass sits on-prem, normalizes fieldbus protocols, buffers during WAN outages, and forwards to the cloud. SiteWise can also ingest directly from the SiteWise Edge gateway at the plant.
2. Ingestion & routing. IoT Core’s message broker receives every publish. The IoT Rules Engine is the fan-out hub: a single inbound message can be routed by SQL-like rules to multiple destinations simultaneously. One rule pushes the full firehose to Amazon Kinesis Data Streams for stream processing; another sends a filtered subset (e.g., only alarm=true messages) to AWS IoT Events or directly to an SNS topic; a third writes structured measurements into AWS IoT SiteWise asset properties.
3. Hot path (seconds). Kinesis Data Streams feeds a stream processor — either Kinesis Data Analytics for Apache Flink (Managed Service for Apache Flink) for windowed aggregations and anomaly detection, or a Lambda consumer for simpler enrichment. Anomalies and threshold breaches fan out to SNS / IoT Events for alerting and to DynamoDB for the “current device state” lookup that operators query. Latency from sensor to alert is single-digit seconds.
4. Warm path / time-series store. Cleaned, enriched records land in Amazon Timestream — a purpose-built, serverless time-series database with a memory tier (fast, recent) and a magnetic tier (cheap, historical), and automatic tiering between them. In parallel, IoT SiteWise models the data against an asset hierarchy (Site → Line → Machine → Sensor), computes derived metrics (“transforms” and “metrics” like rolling OEE or average pressure), and serves both engineers and dashboards.
5. Cold path / data lake. Kinesis Data Firehose continuously batches raw and processed telemetry into Amazon S3 in Parquet, partitioned by date and device group. This is the immutable system of record. AWS Glue catalogs it; Athena queries it ad hoc; the lake feeds ML training (SageMaker) and long-term compliance retention via S3 lifecycle policies into Glacier.
6. Visualization & consumption. Amazon QuickSight is the business-facing layer. It connects natively to Timestream and Athena, uses SPICE (its in-memory engine) for snappy dashboards, and supports embedded analytics so the utility’s operators or the manufacturer’s dealers see dashboards inside their own portals. SiteWise also offers SiteWise Monitor portals for plant engineers who think in assets, not SQL.
In one sentence: IoT Core authenticates and ingests, the Rules Engine fans data into Kinesis (hot) and SiteWise/Firehose (warm/cold), Timestream and SiteWise store and model it, and QuickSight + SiteWise Monitor present it — with S3 as the durable lake underneath everything.
Component breakdown
| Component | Role | Why it’s here | Key configuration choices |
|---|---|---|---|
| AWS IoT Core | Managed MQTT broker + device registry + Rules Engine | Secure, scalable front door for millions of devices with per-device identity | X.509 cert per device via fleet provisioning; Thing Groups for fleet management; basic ingest topics to skip broker billing on high-volume rule traffic |
| AWS IoT Greengrass | Edge runtime on on-prem gateways | Protocol translation (Modbus/OPC-UA→MQTT), local buffering, edge ML inference during WAN loss | Stream Manager for store-and-forward; component-based deployment; local Lambda/container compute |
| IoT Rules Engine | SQL routing of inbound messages | Single message → many destinations without custom code | SQL SELECT with functions; error action to SQS/CloudWatch; route to Kinesis, SiteWise, Firehose, Lambda in parallel |
| Kinesis Data Streams | Ordered, replayable telemetry pipe | Decouples ingestion from processing; absorbs spikes; multiple consumers | On-demand mode for unpredictable IoT load, or provisioned shards (1 MB/s or 1,000 records/s each); partition key = deviceId; 24h–365d retention |
| Managed Service for Apache Flink | Stateful stream processing | Windowed aggregations, sessionization, real-time anomaly detection (RANDOM_CUT_FOREST) | Tumbling/sliding windows; checkpointing to S3; autoscaling parallelism |
| Amazon Timestream | Serverless time-series database | Purpose-built for trillions of events/day with automatic hot→cold tiering | Memory store retention (hours/days) + magnetic store (years); scheduled queries for rollups; partition by dimension |
| AWS IoT SiteWise | Industrial asset modeling + time-series | Gives raw signals asset context and computes OEE/derived metrics; engineer-friendly | Asset models + hierarchies; transforms & metrics; SiteWise Edge for on-prem; data buffering |
| Kinesis Data Firehose | Batched delivery to the lake | Zero-code, serverless landing of data into S3/Redshift | Parquet conversion via Glue schema; dynamic partitioning by deviceGroup/date; buffering 64–128 MB |
| Amazon S3 | Durable data lake / system of record | Immutable raw store, ML source, compliance archive | Partitioned Parquet; lifecycle to Glacier; Object Lock for WORM compliance |
| Amazon DynamoDB | “Current state” / device shadow store | Single-digit-ms lookup of latest reading per device for operator screens | deviceId PK; TTL on stale entries; on-demand capacity |
| Amazon QuickSight | BI & embedded dashboards | Self-service analytics for business users; embeddable in portals | SPICE for speed, direct query for freshness; row-level security per tenant; embedding via OIDC |
| SiteWise Monitor | Asset-centric operational portals | Plant engineers visualize hierarchies without BI tooling | Portals, projects, dashboards scoped by asset; SSO via IAM Identity Center |
A note on why two time-series destinations exist. Timestream is a general-purpose time-series database you query with SQL — ideal for cross-device analytics, ML feature engineering, and QuickSight. SiteWise is opinionated around the industrial asset model: it natively understands a hierarchy of equipment and computes engineering metrics. Many enterprises run both — SiteWise for the OT/engineering audience and the asset graph, Timestream (and the S3 lake) for the data-science and BI audience. Smaller deployments pick one. We call this out explicitly so the choice is deliberate, not accidental.
Implementation guidance
Device identity and provisioning. Never ship devices with a shared certificate. Use IoT Core fleet provisioning: each device presents a bootstrap claim certificate, calls the provisioning template, and receives a unique X.509 certificate bound to a Thing. Attach an IoT policy scoped with policy variables so a device can only publish/subscribe to its own topic namespace:
{
"Effect": "Allow",
"Action": ["iot:Publish"],
"Resource": "arn:aws:iot:*:*:topic/dt/${iot:Connection.Thing.ThingName}/telemetry"
}
This is the cornerstone of Zero Trust here — a compromised device cannot impersonate or eavesdrop on any other.
Topic design. Adopt a hierarchical topic taxonomy, e.g. dt/<site>/<assetType>/<deviceId>/telemetry for data and cmd/<deviceId>/# for commands. Use basic ingest ($aws/rules/<ruleName>/...) for high-volume telemetry so you pay Rules Engine and downstream costs but skip per-message broker messaging charges.
IaC with Terraform. Manage everything as code. A representative module layout:
# IoT Core ingestion rule -> Kinesis (hot) + Firehose (cold) + SiteWise
resource "aws_iot_topic_rule" "telemetry_fanout" {
name = "telemetry_fanout"
enabled = true
sql = "SELECT *, topic(3) AS deviceId FROM 'dt/+/+/+/telemetry'"
sql_version = "2016-03-23"
kinesis {
role_arn = aws_iam_role.iot_to_kinesis.arn
stream_name = aws_kinesis_stream.telemetry.name
partition_key = "$${deviceId}"
}
firehose {
role_arn = aws_iam_role.iot_to_firehose.arn
delivery_stream_name = aws_kinesis_firehose_delivery_stream.lake.name
separator = "\n"
}
error_action {
sqs {
role_arn = aws_iam_role.iot_dlq.arn
queue_url = aws_sqs_queue.rule_dlq.id
use_base64 = false
}
}
}
resource "aws_kinesis_stream" "telemetry" {
name = "telemetry"
stream_mode_details { stream_mode = "ON_DEMAND" }
retention_period = 24
}
resource "aws_timestreamwrite_database" "iot" { database_name = "iot_analytics" }
resource "aws_timestreamwrite_table" "telemetry" {
database_name = aws_timestreamwrite_database.iot.database_name
table_name = "device_telemetry"
retention_properties {
memory_store_retention_period_in_hours = 24
magnetic_store_retention_period_in_days = 1825 # 5 years
}
}
Note the error action on every rule — IoT rule failures are silent unless you route them to a dead-letter queue. For SiteWise, model asset hierarchies with aws_iotsitewise_asset_model and aws_iotsitewise_asset, then wire the SiteWise rule action to map MQTT payload fields to asset property IDs. (Bicep/Deployment Manager don’t apply here — this is AWS-native; the Terraform AWS provider is the standard choice, with the CDK as an alternative for teams that prefer TypeScript/Python.)
Stream processing. For the Flink job, checkpoint to S3, key streams by deviceId, and use tumbling windows for periodic rollups (1-min averages) plus RANDOM_CUT_FOREST for unsupervised anomaly scores. Emit anomalies to a second Kinesis stream or directly to IoT Events for the alarm state machine.
Networking and identity wiring. Keep processing private: run Flink/Lambda consumers in a VPC and reach AWS services over VPC interface endpoints (PrivateLink) for Kinesis, Timestream, and S3 (gateway endpoint) so analytics traffic never traverses the public internet. IoT Core’s data plane is a public TLS endpoint by design, but you can use IoT Core VPC endpoints for device traffic from within your network or via Direct Connect for fixed industrial sites. Use IAM Identity Center for human SSO into QuickSight and SiteWise Monitor; use IAM roles (never long-lived keys) for every service-to-service hop. QuickSight embedding uses the GenerateEmbedUrlForRegisteredUser API behind your app’s auth.
Enterprise considerations
Security & Zero Trust. Identity is per-device, per-user, and per-service — never shared. Device certs are revocable instantly via the registry and continuously audited by AWS IoT Device Defender, which detects anomalous behavior (a device suddenly publishing 100x its baseline, or to topics outside its policy) and can quarantine it. Encrypt everywhere: TLS in transit; KMS-managed keys at rest in Kinesis, Timestream, S3, and DynamoDB. Scope IoT policies with policy variables (above). For the data lake, Lake Formation enforces column- and row-level permissions so the BI team sees aggregates but not raw GPS traces if policy forbids. QuickSight row-level security ensures a dealer sees only their own units in an embedded dashboard.
Cost optimization. IoT cost scales with message count and size, so the biggest lever is edge aggregation — have Greengrass or the device batch and pre-aggregate rather than firehosing every raw reading. Use basic ingest to drop broker messaging charges on telemetry that only feeds rules. Pick Kinesis on-demand for spiky fleets but switch to provisioned shards once load is predictable (often 40–60% cheaper at steady state). Lean on Timestream’s tiering: keep memory-store retention short (hours/days) and let data fall to the cheap magnetic tier; use scheduled queries to pre-aggregate so dashboards hit small rollup tables, not raw events. Land the lake in Parquet (5–10x cheaper to scan in Athena than JSON) and lifecycle cold partitions to Glacier. In QuickSight, SPICE caching avoids re-querying Timestream on every dashboard view, and the per-reader pricing keeps embedded analytics economical at scale.
Scalability. Every tier is horizontally elastic: IoT Core scales to millions of connections; Kinesis to thousands of shards (or on-demand auto-scale); Timestream and Firehose are serverless. Partition keys (deviceId) keep ordering per-device while spreading load. The architecture’s scaling is sub-linear in cost because of aggregation, tiering, and Parquet — doubling device count does not double the bill.
Reliability & DR (RTO/RPO). Kinesis replication is multi-AZ by default; set stream retention to 24h–7d so a downstream outage is replayable (this gives a near-zero RPO for the hot path — you can reprocess). The S3 lake is the durable source of truth (eleven nines); enable Cross-Region Replication for the buckets and Glacier archives if you need regional DR. Timestream is multi-AZ within a region; for cross-region resilience, the S3 lake is your rebuild source. A practical target: RPO ≈ minutes (bounded by Firehose buffer + Kinesis retention) and RTO of a few hours to stand up the processing/serving tier in a second region from IaC, with the lake replicated. IoT Core devices should be configured with exponential-backoff reconnection and offline buffering (Greengrass Stream Manager / device-side queue) so a regional blip doesn’t lose data.
Observability. CloudWatch metrics on IoT Core (Connect.Success, RulesExecuted, RuleMessageThrottled), Kinesis (IteratorAge — the canary for consumer lag), Firehose delivery success, and Timestream write/query latency. Alarm on IteratorAge rising (processing falling behind) and on IoT rule error-action DLQ depth. Device Defender feeds a security dashboard. Trace the pipeline end-to-end and surface the operational view in CloudWatch dashboards, while business KPIs live in QuickSight.
Governance. Tag every resource by cost center and environment. Use AWS Organizations + SCPs to enforce encryption and region restrictions. Catalog the lake in Glue Data Catalog and govern access with Lake Formation. Retain raw telemetry per regulatory mandate using S3 Object Lock (WORM) for tamper-proof compliance records — critical for the utility and cold-chain cases.
Reference enterprise example
ColdHaul Logistics operates 8,000 refrigerated trailers across North America. Each reefer reports GPS, setpoint vs. actual temperature, ambient temperature, door open/close events, and fuel level every 30 seconds — roughly 23 million messages per day, spiking to a reconnection storm of 8,000 simultaneous connects whenever trailers exit a dead zone. A single spoiled produce load costs the customer ~$45,000 and triggers a chargeback; regulators (FSMA, FDA cold-chain) require 2 years of auditable temperature records.
Their build:
- IoT Core with fleet provisioning — 8,000 unique certs, scoped policies, Device Defender watching for compromised modems. Devices use basic ingest to feed rules directly.
- Rules Engine fans each message three ways: (1) full stream → Kinesis on-demand; (2)
door_open && movingortemp > setpoint + 3°C for 5 min→ IoT Events alarm state machine → SNS to dispatcher + DynamoDB current-state; (3) raw → Firehose → S3 Parquet, partitioned byregion/date. - Managed Service for Apache Flink computes per-trailer 5-minute temperature averages and runs
RANDOM_CUT_FORESTto catch failing compressors before the load is lost — the highest-value signal in the system. - Timestream stores enriched readings: 48-hour memory tier for live ops, 2-year magnetic tier for compliance. Scheduled queries roll up hourly min/max/avg per trailer.
- QuickSight powers the ops control tower (live map, trailers-at-risk leaderboard from DynamoDB + Timestream) and an embedded customer portal where each shipper sees only their loads via row-level security.
- S3 + Object Lock holds the 2-year WORM compliance archive; lifecycle moves >90-day data to Glacier.
Numbers and outcome. At ~23M msgs/day with edge batching (devices send a 10-reading bundle every 5 minutes rather than each reading raw), monthly costs land near IoT Core ingest ~$600, Kinesis on-demand ~$700, Flink ~$900, Timestream ~$1,400, Firehose+S3+Glacier ~$500, QuickSight ~$1,200, totaling roughly $5,300/month — under $0.70 per trailer per month to monitor a $150,000 asset hauling perishable freight. In the first year, early compressor-failure detection prevented an estimated 31 spoiled loads (~$1.4M avoided), cold-chain chargebacks fell 78%, and the compliance team replaced a quarterly fire drill of CSV exports with a one-click audit export from the WORM archive. The data scientists later trained a remaining-useful-life model on the S3 lake with zero new pipeline work — the lake was already there.
When to use it
Use this architecture when you have genuine device telemetry at scale, need both real-time alerting and long-horizon historical analytics, require per-device security and revocability, and want business users (not just engineers) to self-serve insights. It shines for connected products, industrial/OT modernization, fleet and cold-chain monitoring, and energy/utility telemetry.
Trade-offs and decisions:
- Timestream vs. SiteWise. If your audience is OT/plant engineers who think in equipment hierarchies and need OEE, lead with SiteWise. If it’s data scientists and BI analysts who want SQL and ML, lead with Timestream + S3. Running both is common but doubles modeling effort — do it deliberately.
- Kinesis vs. MSK vs. direct-to-Firehose. For simple “land it in the lake,” IoT Rule → Firehose directly is cheaper and needs no stream processor. Add Kinesis when you need replay, multiple independent consumers, or sub-second processing. Choose MSK (Kafka) only if you have existing Kafka ecosystems or need its specific semantics — otherwise Kinesis is lower-ops.
- Timescale on RDS / OpenSearch. OpenSearch is tempting for time-series + search, but at high cardinality and long retention it gets expensive and operationally heavy versus serverless Timestream. Reach for it when you genuinely need full-text/log search alongside metrics.
Anti-patterns to avoid:
- No edge aggregation — firehosing every raw 1-second reading to the cloud is the single biggest cost mistake. Aggregate or batch at the edge.
- Shared device certificates — one stolen device compromises the fleet and is unrevocable in practice. Always per-device identity.
- Treating Timestream/SiteWise as the system of record — they’re serving layers. The S3 lake is the durable truth; everything else is rebuildable from it.
- Skipping IoT rule error actions — silent rule failures lose data invisibly. Always attach a DLQ.
- Querying raw events from dashboards — use scheduled-query rollups and SPICE so QuickSight hits small aggregates, not billions of rows.
- Building this for low-volume or non-streaming data — if you have a few hundred readings a day or batch file uploads, this is over-engineered. A scheduled Glue job into S3 + Athena + QuickSight is the right-sized alternative.
The pattern’s strength is that each layer is independently serverless or elastic, the data path degrades gracefully (devices buffer, Kinesis replays, the lake never forgets), and cost scales sub-linearly — so the same blueprint serves a 40-unit pilot and an 8-million-device global fleet with only configuration changes.