Architecture AWS

AWS Enterprise Architecture: IoT Analytics

Industrial and connected-product companies sit on a paradox: their machines emit more telemetry than ever, yet the data lands in a dozen disconnected silos — a SCADA historian here, a CSV export there, a vendor cloud nobody can query. The result is that the most operationally valuable signal in the business (what the physical assets are actually doing, second by second) is the hardest to analyze. This article lays out a reusable AWS architecture that turns raw device telemetry into governed, queryable, dashboard-ready analytics, scaling from a single pilot line to a global fleet of millions of devices.

The business scenario

Picture three companies that look different but share the same shape of problem.

A mid-size HVAC manufacturer ships 40,000 rooftop units a year, each with a cellular modem. Warranty claims are killing margins, and the only way to know a compressor is failing is when a customer calls. They want predictive maintenance, but the telemetry never leaves the unit’s flash memory.

A regional water utility runs 1,200 pumping stations with PLCs speaking Modbus and OPC-UA. Their historian retains 90 days at full resolution, then downsamples and effectively forgets. Regulators now demand multi-year auditable records of flow, pressure, and chlorine dosing, and the operations team wants a single pane of glass instead of 1,200 vendor HMIs.

A logistics operator has 8,000 refrigerated trailers (reefers) with GPS, temperature, door, and fuel sensors reporting every 30 seconds. Cold-chain compliance fines are real money, and a single spoiled load can cost more than a year of cloud bills.

Every one of these is the same architecture under the hood: high-cardinality device telemetry, ingested securely at the edge, processed in motion, stored as time-series, and surfaced to both engineers and business users. What differs is scale (thousands vs. millions of devices), protocol (MQTT vs. industrial fieldbus), and latency tolerance (a reefer alarm needs seconds; a warranty trend can wait minutes).

The non-negotiable requirements that recur across all three:

This is the canonical AWS IoT analytics stack: IoT Core → Kinesis → Timestream / SiteWise → QuickSight, with S3 as the durable backbone.

Architecture overview

The end-to-end data path moves left to right, from physical asset to business insight, with a clean split between a low-latency “hot path” and a high-throughput “warm/cold path.”

AWS IoT analytics reference architecture: devices and IoT Greengrass ingest to IoT Core; the Rules Engine fans out to a Kinesis hot path (Managed Flink, DynamoDB, SNS/IoT Events), a Timestream and IoT SiteWise time-series tier, and a Firehose-to-S3 cold-path data lake (Glue, Athena, SageMaker), surfaced via QuickSight and SiteWise Monitor.

1. Edge & connectivity. Devices connect to AWS IoT Core over MQTT (TLS 1.3, mutual auth with X.509 certificates) or, for constrained cellular fleets, MQTT-over-WebSockets. Industrial sites that speak Modbus/OPC-UA don’t talk MQTT natively, so a gateway running AWS IoT Greengrass sits on-prem, normalizes fieldbus protocols, buffers during WAN outages, and forwards to the cloud. SiteWise can also ingest directly from the SiteWise Edge gateway at the plant.

2. Ingestion & routing. IoT Core’s message broker receives every publish. The IoT Rules Engine is the fan-out hub: a single inbound message can be routed by SQL-like rules to multiple destinations simultaneously. One rule pushes the full firehose to Amazon Kinesis Data Streams for stream processing; another sends a filtered subset (e.g., only alarm=true messages) to AWS IoT Events or directly to an SNS topic; a third writes structured measurements into AWS IoT SiteWise asset properties.

3. Hot path (seconds). Kinesis Data Streams feeds a stream processor — either Kinesis Data Analytics for Apache Flink (Managed Service for Apache Flink) for windowed aggregations and anomaly detection, or a Lambda consumer for simpler enrichment. Anomalies and threshold breaches fan out to SNS / IoT Events for alerting and to DynamoDB for the “current device state” lookup that operators query. Latency from sensor to alert is single-digit seconds.

4. Warm path / time-series store. Cleaned, enriched records land in Amazon Timestream — a purpose-built, serverless time-series database with a memory tier (fast, recent) and a magnetic tier (cheap, historical), and automatic tiering between them. In parallel, IoT SiteWise models the data against an asset hierarchy (Site → Line → Machine → Sensor), computes derived metrics (“transforms” and “metrics” like rolling OEE or average pressure), and serves both engineers and dashboards.

5. Cold path / data lake. Kinesis Data Firehose continuously batches raw and processed telemetry into Amazon S3 in Parquet, partitioned by date and device group. This is the immutable system of record. AWS Glue catalogs it; Athena queries it ad hoc; the lake feeds ML training (SageMaker) and long-term compliance retention via S3 lifecycle policies into Glacier.

6. Visualization & consumption. Amazon QuickSight is the business-facing layer. It connects natively to Timestream and Athena, uses SPICE (its in-memory engine) for snappy dashboards, and supports embedded analytics so the utility’s operators or the manufacturer’s dealers see dashboards inside their own portals. SiteWise also offers SiteWise Monitor portals for plant engineers who think in assets, not SQL.

In one sentence: IoT Core authenticates and ingests, the Rules Engine fans data into Kinesis (hot) and SiteWise/Firehose (warm/cold), Timestream and SiteWise store and model it, and QuickSight + SiteWise Monitor present it — with S3 as the durable lake underneath everything.

Component breakdown

Component Role Why it’s here Key configuration choices
AWS IoT Core Managed MQTT broker + device registry + Rules Engine Secure, scalable front door for millions of devices with per-device identity X.509 cert per device via fleet provisioning; Thing Groups for fleet management; basic ingest topics to skip broker billing on high-volume rule traffic
AWS IoT Greengrass Edge runtime on on-prem gateways Protocol translation (Modbus/OPC-UA→MQTT), local buffering, edge ML inference during WAN loss Stream Manager for store-and-forward; component-based deployment; local Lambda/container compute
IoT Rules Engine SQL routing of inbound messages Single message → many destinations without custom code SQL SELECT with functions; error action to SQS/CloudWatch; route to Kinesis, SiteWise, Firehose, Lambda in parallel
Kinesis Data Streams Ordered, replayable telemetry pipe Decouples ingestion from processing; absorbs spikes; multiple consumers On-demand mode for unpredictable IoT load, or provisioned shards (1 MB/s or 1,000 records/s each); partition key = deviceId; 24h–365d retention
Managed Service for Apache Flink Stateful stream processing Windowed aggregations, sessionization, real-time anomaly detection (RANDOM_CUT_FOREST) Tumbling/sliding windows; checkpointing to S3; autoscaling parallelism
Amazon Timestream Serverless time-series database Purpose-built for trillions of events/day with automatic hot→cold tiering Memory store retention (hours/days) + magnetic store (years); scheduled queries for rollups; partition by dimension
AWS IoT SiteWise Industrial asset modeling + time-series Gives raw signals asset context and computes OEE/derived metrics; engineer-friendly Asset models + hierarchies; transforms & metrics; SiteWise Edge for on-prem; data buffering
Kinesis Data Firehose Batched delivery to the lake Zero-code, serverless landing of data into S3/Redshift Parquet conversion via Glue schema; dynamic partitioning by deviceGroup/date; buffering 64–128 MB
Amazon S3 Durable data lake / system of record Immutable raw store, ML source, compliance archive Partitioned Parquet; lifecycle to Glacier; Object Lock for WORM compliance
Amazon DynamoDB “Current state” / device shadow store Single-digit-ms lookup of latest reading per device for operator screens deviceId PK; TTL on stale entries; on-demand capacity
Amazon QuickSight BI & embedded dashboards Self-service analytics for business users; embeddable in portals SPICE for speed, direct query for freshness; row-level security per tenant; embedding via OIDC
SiteWise Monitor Asset-centric operational portals Plant engineers visualize hierarchies without BI tooling Portals, projects, dashboards scoped by asset; SSO via IAM Identity Center

A note on why two time-series destinations exist. Timestream is a general-purpose time-series database you query with SQL — ideal for cross-device analytics, ML feature engineering, and QuickSight. SiteWise is opinionated around the industrial asset model: it natively understands a hierarchy of equipment and computes engineering metrics. Many enterprises run both — SiteWise for the OT/engineering audience and the asset graph, Timestream (and the S3 lake) for the data-science and BI audience. Smaller deployments pick one. We call this out explicitly so the choice is deliberate, not accidental.

Implementation guidance

Device identity and provisioning. Never ship devices with a shared certificate. Use IoT Core fleet provisioning: each device presents a bootstrap claim certificate, calls the provisioning template, and receives a unique X.509 certificate bound to a Thing. Attach an IoT policy scoped with policy variables so a device can only publish/subscribe to its own topic namespace:

{
  "Effect": "Allow",
  "Action": ["iot:Publish"],
  "Resource": "arn:aws:iot:*:*:topic/dt/${iot:Connection.Thing.ThingName}/telemetry"
}

This is the cornerstone of Zero Trust here — a compromised device cannot impersonate or eavesdrop on any other.

Topic design. Adopt a hierarchical topic taxonomy, e.g. dt/<site>/<assetType>/<deviceId>/telemetry for data and cmd/<deviceId>/# for commands. Use basic ingest ($aws/rules/<ruleName>/...) for high-volume telemetry so you pay Rules Engine and downstream costs but skip per-message broker messaging charges.

IaC with Terraform. Manage everything as code. A representative module layout:

# IoT Core ingestion rule -> Kinesis (hot) + Firehose (cold) + SiteWise
resource "aws_iot_topic_rule" "telemetry_fanout" {
  name        = "telemetry_fanout"
  enabled     = true
  sql         = "SELECT *, topic(3) AS deviceId FROM 'dt/+/+/+/telemetry'"
  sql_version = "2016-03-23"

  kinesis {
    role_arn      = aws_iam_role.iot_to_kinesis.arn
    stream_name   = aws_kinesis_stream.telemetry.name
    partition_key = "$${deviceId}"
  }

  firehose {
    role_arn           = aws_iam_role.iot_to_firehose.arn
    delivery_stream_name = aws_kinesis_firehose_delivery_stream.lake.name
    separator          = "\n"
  }

  error_action {
    sqs {
      role_arn   = aws_iam_role.iot_dlq.arn
      queue_url  = aws_sqs_queue.rule_dlq.id
      use_base64 = false
    }
  }
}

resource "aws_kinesis_stream" "telemetry" {
  name             = "telemetry"
  stream_mode_details { stream_mode = "ON_DEMAND" }
  retention_period = 24
}

resource "aws_timestreamwrite_database" "iot" { database_name = "iot_analytics" }
resource "aws_timestreamwrite_table" "telemetry" {
  database_name = aws_timestreamwrite_database.iot.database_name
  table_name    = "device_telemetry"
  retention_properties {
    memory_store_retention_period_in_hours  = 24
    magnetic_store_retention_period_in_days = 1825   # 5 years
  }
}

Note the error action on every rule — IoT rule failures are silent unless you route them to a dead-letter queue. For SiteWise, model asset hierarchies with aws_iotsitewise_asset_model and aws_iotsitewise_asset, then wire the SiteWise rule action to map MQTT payload fields to asset property IDs. (Bicep/Deployment Manager don’t apply here — this is AWS-native; the Terraform AWS provider is the standard choice, with the CDK as an alternative for teams that prefer TypeScript/Python.)

Stream processing. For the Flink job, checkpoint to S3, key streams by deviceId, and use tumbling windows for periodic rollups (1-min averages) plus RANDOM_CUT_FOREST for unsupervised anomaly scores. Emit anomalies to a second Kinesis stream or directly to IoT Events for the alarm state machine.

Networking and identity wiring. Keep processing private: run Flink/Lambda consumers in a VPC and reach AWS services over VPC interface endpoints (PrivateLink) for Kinesis, Timestream, and S3 (gateway endpoint) so analytics traffic never traverses the public internet. IoT Core’s data plane is a public TLS endpoint by design, but you can use IoT Core VPC endpoints for device traffic from within your network or via Direct Connect for fixed industrial sites. Use IAM Identity Center for human SSO into QuickSight and SiteWise Monitor; use IAM roles (never long-lived keys) for every service-to-service hop. QuickSight embedding uses the GenerateEmbedUrlForRegisteredUser API behind your app’s auth.

Enterprise considerations

Security & Zero Trust. Identity is per-device, per-user, and per-service — never shared. Device certs are revocable instantly via the registry and continuously audited by AWS IoT Device Defender, which detects anomalous behavior (a device suddenly publishing 100x its baseline, or to topics outside its policy) and can quarantine it. Encrypt everywhere: TLS in transit; KMS-managed keys at rest in Kinesis, Timestream, S3, and DynamoDB. Scope IoT policies with policy variables (above). For the data lake, Lake Formation enforces column- and row-level permissions so the BI team sees aggregates but not raw GPS traces if policy forbids. QuickSight row-level security ensures a dealer sees only their own units in an embedded dashboard.

Cost optimization. IoT cost scales with message count and size, so the biggest lever is edge aggregation — have Greengrass or the device batch and pre-aggregate rather than firehosing every raw reading. Use basic ingest to drop broker messaging charges on telemetry that only feeds rules. Pick Kinesis on-demand for spiky fleets but switch to provisioned shards once load is predictable (often 40–60% cheaper at steady state). Lean on Timestream’s tiering: keep memory-store retention short (hours/days) and let data fall to the cheap magnetic tier; use scheduled queries to pre-aggregate so dashboards hit small rollup tables, not raw events. Land the lake in Parquet (5–10x cheaper to scan in Athena than JSON) and lifecycle cold partitions to Glacier. In QuickSight, SPICE caching avoids re-querying Timestream on every dashboard view, and the per-reader pricing keeps embedded analytics economical at scale.

Scalability. Every tier is horizontally elastic: IoT Core scales to millions of connections; Kinesis to thousands of shards (or on-demand auto-scale); Timestream and Firehose are serverless. Partition keys (deviceId) keep ordering per-device while spreading load. The architecture’s scaling is sub-linear in cost because of aggregation, tiering, and Parquet — doubling device count does not double the bill.

Reliability & DR (RTO/RPO). Kinesis replication is multi-AZ by default; set stream retention to 24h–7d so a downstream outage is replayable (this gives a near-zero RPO for the hot path — you can reprocess). The S3 lake is the durable source of truth (eleven nines); enable Cross-Region Replication for the buckets and Glacier archives if you need regional DR. Timestream is multi-AZ within a region; for cross-region resilience, the S3 lake is your rebuild source. A practical target: RPO ≈ minutes (bounded by Firehose buffer + Kinesis retention) and RTO of a few hours to stand up the processing/serving tier in a second region from IaC, with the lake replicated. IoT Core devices should be configured with exponential-backoff reconnection and offline buffering (Greengrass Stream Manager / device-side queue) so a regional blip doesn’t lose data.

Observability. CloudWatch metrics on IoT Core (Connect.Success, RulesExecuted, RuleMessageThrottled), Kinesis (IteratorAge — the canary for consumer lag), Firehose delivery success, and Timestream write/query latency. Alarm on IteratorAge rising (processing falling behind) and on IoT rule error-action DLQ depth. Device Defender feeds a security dashboard. Trace the pipeline end-to-end and surface the operational view in CloudWatch dashboards, while business KPIs live in QuickSight.

Governance. Tag every resource by cost center and environment. Use AWS Organizations + SCPs to enforce encryption and region restrictions. Catalog the lake in Glue Data Catalog and govern access with Lake Formation. Retain raw telemetry per regulatory mandate using S3 Object Lock (WORM) for tamper-proof compliance records — critical for the utility and cold-chain cases.

Reference enterprise example

ColdHaul Logistics operates 8,000 refrigerated trailers across North America. Each reefer reports GPS, setpoint vs. actual temperature, ambient temperature, door open/close events, and fuel level every 30 seconds — roughly 23 million messages per day, spiking to a reconnection storm of 8,000 simultaneous connects whenever trailers exit a dead zone. A single spoiled produce load costs the customer ~$45,000 and triggers a chargeback; regulators (FSMA, FDA cold-chain) require 2 years of auditable temperature records.

Their build:

Numbers and outcome. At ~23M msgs/day with edge batching (devices send a 10-reading bundle every 5 minutes rather than each reading raw), monthly costs land near IoT Core ingest ~$600, Kinesis on-demand ~$700, Flink ~$900, Timestream ~$1,400, Firehose+S3+Glacier ~$500, QuickSight ~$1,200, totaling roughly $5,300/month — under $0.70 per trailer per month to monitor a $150,000 asset hauling perishable freight. In the first year, early compressor-failure detection prevented an estimated 31 spoiled loads (~$1.4M avoided), cold-chain chargebacks fell 78%, and the compliance team replaced a quarterly fire drill of CSV exports with a one-click audit export from the WORM archive. The data scientists later trained a remaining-useful-life model on the S3 lake with zero new pipeline work — the lake was already there.

When to use it

Use this architecture when you have genuine device telemetry at scale, need both real-time alerting and long-horizon historical analytics, require per-device security and revocability, and want business users (not just engineers) to self-serve insights. It shines for connected products, industrial/OT modernization, fleet and cold-chain monitoring, and energy/utility telemetry.

Trade-offs and decisions:

Anti-patterns to avoid:

The pattern’s strength is that each layer is independently serverless or elastic, the data path degrades gracefully (devices buffer, Kinesis replays, the lake never forgets), and cost scales sub-linearly — so the same blueprint serves a 40-unit pilot and an 8-million-device global fleet with only configuration changes.

AWSArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading