Architecture AWS

Smart Building IoT and Energy Optimization on AWS

A commercial real-estate operator — call it the kind of REIT that manages 140 office and lab buildings across a dozen metros — gets handed two numbers by its board that cannot both be true with the status quo. The first is a 30% portfolio-wide energy reduction by 2030 to meet the disclosure commitments its tenants and lenders now demand. The second is the line item showing that energy is the single largest controllable operating cost in the portfolio, and that HVAC alone is roughly half of it. Today each building runs its own Building Management System (BMS) from a different vendor, with a facilities engineer who “knows the building” tuning chiller setpoints by feel, and an energy report that arrives 45 days after the month it describes. There is no way to know — across 140 buildings, in something close to real time — which air handler is fighting itself, which floor is being cooled and heated simultaneously, or what a one-degree setpoint change would actually save. This article is the reference architecture for fixing that on AWS: a fleet-wide IoT telemetry, analytics, and closed-loop optimization platform that a Chief Engineering Officer and a CISO will both sign.

The pressures are specific to physical infrastructure and worth naming, because they shape every choice downstream. Heterogeneity: BMS talk BACnet/IP, Modbus, and a long tail of proprietary protocols; you do not get to pick the field devices. Edge reality: a building’s network is flaky, the link to the cloud drops, and a chiller cannot wait for a round trip to decide whether to trip. Safety: this is operational technology — a bad setpoint pushed to a real air handler at 2pm in July is a comfort complaint at best and a frozen coil or a tenant evacuation at worst. Scale: 140 buildings times thousands of points each is a high-cardinality, high-frequency time-series problem that a relational database will not survive. And cost: the platform exists to save money, so its own bill has to stay a rounding error against the energy it cuts.

Why not the obvious shortcuts

Three shortcuts get proposed on every project like this, and each fails in a way worth saying out loud.

“Just use the BMS vendor’s cloud.” Each vendor offers a portal, but they are per-building silos that do not federate, cannot be queried across the portfolio, and lock the data behind an export you do not control — which defeats the entire point of a fleet view and a board-level number.

“Pull it all into our data warehouse.” A warehouse is built for batch analytics over rows, not for millions of out-of-order sensor samples a minute with per-point retention rules. You will spend more engineering effort fighting ingestion and cardinality than you ever spend on insight, and the freshness still will not be real-time.

“Let the ML model drive the building directly.” Tempting, and exactly how you cause an incident. An optimizer that writes setpoints straight to physical equipment with no edge safety envelope, no human gate for large moves, and no rollback is an OT safety failure waiting for a heat wave. The control loop has to be bounded at the edge and governed in the cloud.

The architecture below threads these: a standard ingestion path that normalizes every protocol, a purpose-built time-series store, an optimization loop that proposes setpoints rather than blindly imposing them, and an edge tier that keeps the building safe even when the cloud is unreachable.

Architecture overview

Smart Building IoT and Energy Optimization on AWS — architecture

The platform runs three loops that share infrastructure but live on different clocks: a high-frequency ingestion loop moving telemetry from equipment to cloud, a periodic optimization loop computing better setpoints, and a slower action loop that turns anomalies and inefficiencies into work the facilities team actually does. Holding those three apart is the first step to operating this well.

The defining property of the topology is where the trust boundary sits. The buildings live in OT networks that are never directly reachable from the internet; the only path out is an outbound-only MQTT/TLS connection from a gateway to AWS IoT Core. Nothing in the cloud dials into a building. That single rule is what lets the security team sign an architecture that touches physical plant.

Ingestion loop, following the data:

  1. In each building sits an edge gateway running AWS IoT Greengrass on a small industrial PC. A protocol-adapter component speaks BACnet/IP and Modbus to the BMS, polling or subscribing to points — supply/return temperatures, valve positions, fan speeds, kWh meters, CO₂, occupancy. Greengrass normalizes vendor tag names to a canonical model and buffers locally, so a dropped WAN link means a backfill, not a data hole.
  2. The gateway publishes over MQTT/TLS to AWS IoT Core, authenticated by a unique X.509 certificate per gateway. The IoT Core rule engine fans each message out: high-value structured asset data goes to AWS IoT SiteWise; the raw stream is also archived to S3 (the cheap, durable system of record) and, for live alerting, routed to a Lambda.
  3. IoT SiteWise is the industrial-asset layer. You model the portfolio as an asset hierarchy — Portfolio → Building → Floor → AHU/Chiller → Measurement — and define transforms and metrics (e.g. chiller kW per ton, simultaneous heating-and-cooling detection) that compute continuously inside SiteWise. This is what turns a swamp of tags into a queryable model an analyst understands.
  4. Time-series samples land in Amazon Timestream, the purpose-built store for this workload. Recent data lives in the in-memory tier for fast dashboards; older data tiers automatically to magnetic storage; per-point retention is policy, not a cron job. Timestream’s cardinality and time-ordering model is built for exactly the “millions of points, out-of-order, with retention” shape that breaks a relational DB.

Optimization loop, on a schedule (typically every 5–15 minutes): an AWS Lambda orchestrator (kicked by EventBridge) reads the recent window from Timestream plus weather and occupancy context, and computes recommended setpoints. The simple, always-on baseline is a rules engine — reset chilled-water temperature on outside-air conditions, widen deadbands in unoccupied zones, stage equipment to favor the most efficient unit. The advanced path calls a model trained in Amazon SageMaker (a load-forecasting and setpoint-optimization model per building archetype) and hosted behind a SageMaker endpoint. Either way the output is a proposed setpoint with a predicted saving and a confidence — never an unconditional command.

Action loop, the one that produces real-world change:

Component breakdown

Component Service / tool Role in the platform Key configuration choices
Edge runtime AWS IoT Greengrass Protocol translation, local buffering, edge safety envelope, store-and-forward BACnet/Modbus adapter component; stream manager; local min/max guardrails
Ingestion broker AWS IoT Core MQTT/TLS broker, per-device X.509 auth, rule-based fan-out Thing-per-gateway certs; rules to SiteWise + S3 + Lambda
Asset model AWS IoT SiteWise Industrial asset hierarchy, continuous transforms/metrics Portfolio→Building→Asset model; kW/ton + simultaneous heat/cool metrics
Time-series store Amazon Timestream High-cardinality telemetry, tiered retention, fast recent queries Memory tier for dashboards; magnetic tier; per-measure retention
Raw archive Amazon S3 Durable system of record, replay, ML training data Lifecycle to Glacier; the rebuild source of truth
Optimization Lambda + Amazon SageMaker Rules baseline + ML setpoint/forecast model EventBridge schedule; per-archetype model; output = proposal + confidence
Dashboards Amazon Managed Grafana Energy, comfort, fleet anomaly, sustainability reporting Timestream + SiteWise data sources; per-building folders; SSO via Okta
ITSM / work orders ServiceNow Fault → incident → facilities work order, change approval for control Auto-ticket on anomaly; change gate before closed-loop control enables
Identity / SSO Okta + (Entra ID) Workforce SSO to Grafana/console; partner BMS tenants via Entra OIDC to AWS IAM Identity Center; group claims drive Grafana org/role
Secrets HashiCorp Vault Third-party API tokens, weather/utility keys, BMS creds at the edge Dynamic leases; Greengrass secret manager pulls short-lived creds
CSPM / posture Wiz + Wiz Code Cloud posture, exposed-data and attack-path analysis; IaC scanning Agentless scan of S3/IoT/IAM; Wiz Code blocks risky Terraform in PR
Runtime security CrowdStrike Falcon Runtime protection on Greengrass edge hosts and any EC2/EKS Sensor on gateway OS; detections to the SOC; OT-aware policy
Observability Datadog (Dynatrace) Platform telemetry, pipeline traces, device-fleet health, cost IoT/Lambda metrics; ingestion-lag monitors; anomaly + forecast monitors
CI / IaC GitHub Actions + Terraform; Argo CD; Ansible Pipeline + infra-as-code; GitOps for edge config; fleet host config OIDC to AWS (no stored creds); Argo CD syncs Greengrass deployments; Ansible bootstraps gateways

A few of these choices deserve the why, because they are the ones teams get wrong.

Why SiteWise sits between IoT Core and the analyst. You could dump raw MQTT straight into Timestream and compute everything in queries, but then every dashboard re-derives “kW per ton” and every analyst re-learns each vendor’s tag soup. SiteWise gives you one asset model with named, continuously computed metrics, so a query asks for “Building 12 / Chiller 2 / efficiency” instead of joining four cryptic point IDs. It also makes the simultaneous heating-and-cooling check — the single most common, most embarrassing waste in commercial HVAC — a first-class metric rather than a tribal-knowledge spreadsheet.

Why Timestream, not a relational or general TSDB. The workload is high cardinality (points × buildings), high frequency, frequently out of order (store-and-forward backfills), and governed by per-point retention. A relational DB chokes on the write volume and the index churn; Timestream’s separation of a fast memory tier from cheap magnetic storage, plus native time-ordering and retention policies, fits the shape directly and bills by ingestion and query rather than provisioned servers you babysit.

Why the optimizer proposes, the edge enforces. Splitting the control authority is the safety story. The cloud optimizer can be as clever as you like because it cannot, by construction, push equipment past the min/max envelope that Greengrass enforces locally. If the WAN drops or a model misbehaves, the building keeps running on the last good setpoints inside safe bounds. Large moves additionally pass a ServiceNow change gate so a human owns the decision to let automation touch plant.

Implementation guidance

Provision with Terraform, and treat the edge fleet and the identity model as first deliverables. The order matters: certificates, IoT policies, and the SiteWise asset model are the foundation everything else binds to.

  1. IoT Core foundation — a Thing type and per-gateway Thing, an X.509 certificate per gateway, and a least-privilege IoT policy scoped so a gateway can only publish to and subscribe under its own topic prefix (a stolen cert must not be able to read the fleet).
  2. SiteWise asset model — the Portfolio → Building → Floor → Asset → Measurement hierarchy and the transforms/metrics, defined in code so a new building is a data change, not a console click.
  3. Timestream — database and tables with memory-tier and magnetic-tier retention set per data class (raw telemetry shorter, billing-grade meter data longer).
  4. Greengrass deployment — the runtime, the protocol-adapter component, and the local-guardrail component, shipped as a versioned deployment.

A minimal Terraform shape for an IoT thing + tightly scoped policy communicates the intent — a gateway is confined to its own topic space:

resource "aws_iot_thing" "gateway" {
  name = "gw-bldg-012-cin"
}

data "aws_iam_policy_document" "gw" {
  statement {
    actions   = ["iot:Publish", "iot:Subscribe", "iot:Connect"]
    resources = [
      "arn:aws:iot:ap-south-1:${var.account}:topic/bldg/012/*",
      "arn:aws:iot:ap-south-1:${var.account}:client/gw-bldg-012-cin",
    ]
  }
}

resource "aws_iot_policy" "gw" {
  name   = "gw-bldg-012"
  policy = data.aws_iam_policy_document.gw.json
}

The pipeline that applies this runs in GitHub Actions, authenticating to AWS via OIDC federation so there is no stored access key to leak — the lesson the platform team intends never to repeat. Wiz Code runs in the same pull request, failing the build if the Terraform would, say, leave an S3 telemetry bucket public or over-broaden an IoT policy. Edge software rolls out by GitOps: Greengrass component versions are declared in Git and Argo CD reconciles the desired deployment to the fleet, while Ansible handles the one-time host bootstrap (OS hardening, agent install, certificate provisioning) for each new gateway.

Identity: federate the humans, scope the machines. Operators, energy analysts, and facilities managers sign in to Amazon Managed Grafana and the AWS console through Okta federated to AWS IAM Identity Center over OIDC; group claims map to Grafana orgs and roles so an analyst sees the whole portfolio while a single-building engineer sees one folder. Where a BMS integrator or a tenant gets scoped partner access, that federation runs through Microsoft Entra ID into the same identity plane. Machines do not use long-lived keys: Lambdas and SageMaker assume IAM roles; gateways authenticate with X.509 certs; and the residual secrets that are neither — utility-API and weather-feed tokens, and the BMS service credentials a gateway needs locally — live in HashiCorp Vault, leased dynamically, with the Greengrass secret manager pulling short-lived copies so a credential never sits in plaintext on a building PC.

Edge wiring. Keep the gateway dumb-but-safe and the cloud smart. The Greengrass adapter publishes on change-of-value with a heartbeat (do not stream every point every second — it is wasteful and floods Timestream); the stream manager batches and survives WAN outages; and the local guardrail component holds the hard min/max envelope independent of any cloud command. Tag every message with buildingId, assetId, point, and effectiveTime so out-of-order backfills land correctly.

Enterprise considerations

Security & Zero Trust. The architecture is Zero Trust by construction at the boundary that matters: buildings are never reachable inbound; the only egress is an authenticated, per-device outbound MQTT/TLS session; and every cloud principal is least-privilege IAM. Layer on top: (a) per-gateway X.509 identity with rotation, so revoking one compromised building is one certificate, not a fleet outage; (b) Wiz running continuous CSPM and sensitive-data-exposure analysis across S3, IoT, and IAM, alerting the moment a telemetry bucket drifts public or an IoT policy widens, with Wiz Code catching the same class of mistake in IaC before merge; © CrowdStrike Falcon sensors on the Greengrass edge hosts and any EC2/EKS, giving runtime threat detection on the very machines that can write to physical plant — the OT attack surface most platforms ignore — feeding the SOC; (d) a security-relevant event (a revoked cert seen reconnecting, a guardrail-clamped command, an anomalous control write) auto-raises a ServiceNow incident so there is a ticket, not just a log line. The control path is doubly gated: edge envelopes bound what can be commanded, and ServiceNow change approval bounds whether a non-trivial command leaves the cloud at all.

Cost optimization. The platform must stay cheap relative to the energy it saves, and the levers are mostly about not moving or storing data you do not need.

Lever Mechanism Typical effect
Edge filtering Publish on change-of-value + heartbeat, not every point every second Slashes ingestion volume and Timestream cost at the source
Timestream tiering Short memory-tier retention; auto-tier to magnetic; raw to S3/Glacier Pay for fast access only on recent data
Right-size the model Rules baseline always-on; SageMaker endpoint only where ML earns it Avoids paying for inference that a reset curve already wins
Serverless control plane Lambda + EventBridge over an always-on cluster No idle compute between 5-minute optimization cycles
S3 lifecycle Age raw archive to Glacier; keep only billing-grade data hot Durable history at archival prices

Pipe per-pipeline and per-building cost and ingestion metrics to Datadog, which the platform team uses for the FinOps dashboard — so a building that suddenly doubles its message rate (a misbehaving gateway) shows up as a cost anomaly before it shows up as a surprise on the bill.

Scalability. Each tier scales independently and is the bottleneck at a different point. IoT Core and Lambda scale to the fleet with no servers to size. Onboarding building 141 is a Terraform change (cert + policy + asset) and a GitOps deployment, not a re-architecture. The real ceilings to watch are Timestream ingestion and query throughput under fleet growth — which is precisely why edge change-of-value filtering matters — and per-account service quotas on IoT and Timestream, which you raise ahead of a portfolio expansion, not during one.

Failure modes, and what each one looks like. Name them before they page you.

Reliability & DR (RTO/RPO). Decide the numbers per tier and be honest that this is a physical-world system. S3 is the durable source of truth (versioned, lifecycle-managed, cross-region replicated), so even a total loss of the analytics tier is a rebuild, not a data loss. Timestream and SiteWise are the derived analytics layer; their DR plan is “replay from S3 into a paired region,” with the asset model re-applied from Terraform. The safety-critical insight is that building operation does not depend on the cloud at all — Greengrass keeps each building safe and running locally through any cloud or regional outage, which inverts the usual DR anxiety. A pragmatic target: RTO 30 minutes, RPO 5 minutes for the analytics and optimization service, with raw history fully recoverable from S3; and zero building-safety impact from a cloud outage by design.

Observability. Instrument the platform end to end in Datadog: ingestion lag (gateway → IoT Core → Timestream), per-gateway connectivity and last-seen, optimization-cycle latency, and the SageMaker endpoint’s health. Emit the metrics the business actually cares about beside the technical ones — kWh and cost avoided per building, comfort-complaint rate (the guardrail that energy savings are not just making people miserable), anomaly-to-work-order time, and % of recommended setpoints accepted. The energy story itself lives in Amazon Managed Grafana, reading SiteWise and Timestream: a portfolio sustainability board for the executives, a per-building efficiency view for engineers, and a fleet anomaly board the operations center watches. Datadog watches the platform; Grafana tells the energy story; both matter and they are not the same dashboard.

Governance. Pin the SageMaker model versions and promote new ones through a backtest against historical S3 data before they are allowed to recommend on live plant — a model that would have over-cooled last August fails the gate. Keep the rules engine and asset model in version control, reviewable and instantly revertable. Every control write is logged with its proposing model version, the operator who approved it (if gated), and the resulting telemetry — an audit trail for both energy claims and any safety review. New buildings and any change that enables closed-loop control pass a ServiceNow change approval, giving operations a documented gate before software is allowed to move physical equipment.

Explicit tradeoffs

Accept these or do not build it. A fleet IoT platform has real moving parts the per-building portal does not: an edge runtime to deploy and patch across 140 sites, an asset model to keep in sync with physical reality, and a control path whose safety you must prove, not assume. Closed-loop control specifically is where ambition meets liability — the edge envelopes, the change gates, the audit log, and the human-approve step on large moves are overhead you can skip for a read-only “monitoring and recommendations” deployment and absolutely cannot skip the day you let the cloud write a setpoint. The Okta-to-Identity-Center federation (and the Entra path for partners) adds an identity hop a single-IdP shop will not need. And operating two observability surfaces — Datadog for the platform, Grafana for the energy — is deliberate duplication you accept because the audiences and questions are different.

The alternatives, and when they win. If you run one building, the BMS vendor’s own analytics may be enough and this platform is overkill. If you only need reporting, not control, stop at SiteWise + Timestream + Grafana and skip the entire action loop — most of the safety complexity lives there, and a monitoring-and-recommendation system already captures a large share of the savings by simply showing engineers what is wrong. If your estate is mostly Azure for compliance or commercial reasons, the same shape maps cleanly to IoT Hub + Azure Digital Twins + Data Explorer; the architectural ideas (edge safety envelope, asset model, time-series tier, propose-not-impose control, ITSM-gated actions) are cloud-agnostic and are the part that actually matters. Graduate to this full platform when portfolio scale, a board-level energy target, and closed-loop control are all on the table at once.

The shape of the win

For the REIT, the payoff is not “an IoT dashboard.” It is that an energy analyst sees, across 140 buildings on one screen, that 38 air handlers are heating and cooling the same air simultaneously; that the platform recommends the reset curves and deadbands to stop it; that the safe, small moves apply automatically within edge guardrails while the large ones wait for a one-click approval; and that the stuck dampers it cannot fix in software become ServiceNow work orders that a technician actually closes. The 30% number stops being a board aspiration and becomes a metered, audited, building-by-building line on a Grafana board the CFO trusts. Everything upstream — the per-device certificates, the Greengrass safety envelope, the SiteWise asset model, the Timestream tiering, the Vault-held edge secrets, the Wiz and CrowdStrike posture, the Datadog FinOps view — exists so that a Chief Engineering Officer, a CISO, and a sustainability officer each say yes. Start with monitoring if you must; this is where a portfolio-scale “optimize our buildings” has to land.

AWSIoT CoreSiteWiseTimestreamSmart BuildingsEnergy
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading