A Time-Series IoT Data Platform

A logistics company runs forty thousand refrigerated trailers, and every one of them is a regulatory obligation on wheels. A reefer carrying vaccines or fresh seafood must hold temperature inside a tolerance band for the entire journey, and the company has to prove it did — to the shipper, to the FDA or the food-safety auditor, to the insurer when a claim lands. So each trailer streams temperature, humidity, door state, GPS, fuel level, and compressor health every few seconds. Multiply forty thousand assets by a dozen channels by a reading every five seconds and you are writing on the order of a hundred thousand data points per second, continuously, forever. Lose a trailer’s data for the ninety minutes it sat in a dead cellular zone outside Laredo and you have an audit gap; let your dashboard go stale by ten minutes and a load spoils before anyone reacts.

This is the shape of an industrial time-series problem, and it does not fit a normal database. A relational store buckles under sustained six-figure inserts per second; an object store is cheap but cannot answer “show me the p95 temperature per trailer per hour for the last 90 days” without a batch job. The pattern that works is a time-series data platform: edge devices that buffer through connectivity gaps, an ingestion tier built to absorb bursts without dropping points, a database designed around time as the primary axis, automated downsampling so you keep raw fidelity briefly and rolled-up summaries cheaply for years, and an analytics layer that serves both a live operations dashboard and the data-science team. This article is a reference architecture for that platform — not a sensor-on-a-breadboard demo, but a multi-tenant, governed system that holds from a thousand devices to several million.

The business scenario

The recurring driver is physical assets generating more measurements than anyone can store at full resolution or query fast enough to act on. Our logistics operator is one face of it. A wind-farm operator pulls 200 channels off each turbine — blade pitch, gearbox vibration, nacelle temperature — at sub-second rates, because a vibration signature that drifts over weeks is the early warning of a gearbox failure that costs a quarter-million to fix unplanned. A smart-metering utility reads ten million electricity meters every fifteen minutes and must bill off it, detect theft from it, and forecast demand with it. A hospital biomed team streams waveform telemetry off infusion pumps and ventilators.

The naive fixes fail in predictable ways. “Just put it in Postgres” works in the pilot with a hundred devices and falls over at ten thousand — index bloat, vacuum storms, and queries that table-scan years of rows. “Keep everything raw forever in a data lake” is storage-cheap but query-hostile: every dashboard becomes a Spark job, and your operations team needs answers in milliseconds, not minutes. “Sample less often” destroys the very signal you are collecting for — the vibration anomaly lives in the high-frequency data. “Stream straight to the cloud with no edge buffer” means every cellular dropout, every backhaul blip, is permanent data loss, which for a regulated cold chain is a compliance failure.

A time-series platform threads this. At the edge, devices and gateways buffer locally and replay when connectivity returns, so a tunnel or a dead zone delays data rather than destroying it. The ingestion tier is a durable, partitioned log that decouples bursty producers from the database and lets you replay on failure. The database is purpose-built for time — it ingests at line rate, compresses aggressively because adjacent readings are similar, and answers time-windowed aggregate queries in milliseconds. Downsampling and rollups run continuously: raw data lives for days or weeks, one-minute and one-hour summaries live for years, and your storage bill stays sublinear to your data growth. The analytics layer fans the same data to a real-time dashboard, an alerting engine, and a lakehouse for the data scientists — each reading the resolution it needs.

The scenario scales cleanly. The small deployment is a thousand devices, tens of thousands of points per second, one region, a single database node. The large one is several million devices, low millions of points per second, multiple regions for data-residency, and a hard tiering policy because nobody can afford to keep petabytes of raw telemetry hot. The shape of the diagram is the same; what changes is partition counts, the size of the time-series cluster, and how aggressively you roll up and tier to cold storage.

Architecture overview

The platform has two flows that share infrastructure but run on different clocks: a hot path (synchronous-ish, sensor-to-dashboard in seconds, drives live operations and alerting) and a warm/cold path (continuous rollups and batch analytics that feed reporting, ML, and long-term retention). Keeping them mentally separate is the first step to operating this well — the hot path optimises for latency and recency, the cold path for cost and completeness.

A Time-Series IoT Data Platform — architecture

Hot path, following the data: (1) Each device or local gateway collects readings and writes them to an on-device buffer; an agent batches and ships them over MQTT (or HTTPS for constrained links). Telemetry enters through an IoT ingestion gateway — AWS IoT Core, Azure IoT Hub, or GCP IoT, depending on cloud — which terminates TLS, authenticates the device by its X.509 client certificate, and enforces per-device throttling. For globally distributed fleets, Akamai (or the cloud’s own edge/CDN) fronts the HTTPS ingestion endpoint to terminate connections close to the device and absorb regional spikes. (2) The gateway publishes every accepted message onto a partitioned streaming log — Apache Kafka (often as Confluent Cloud, AWS MSK, or Azure Event Hubs with the Kafka surface) — keyed by device ID so a device’s readings stay ordered. This log is the shock absorber: producers burst, consumers fall behind and catch up, and nothing is lost because the log is durable and replayable.

(3) A stream-processing stage — Apache Flink, Kafka Streams, or Spark Structured Streaming — consumes the log and does three jobs at once: it validates and deduplicates (devices retrying after a buffer flush will resend), it computes real-time aggregates and alert conditions in tumbling/sliding windows (“temperature above band for >10 minutes”), and it writes clean readings onward. (4) Validated readings land in the time-series database — InfluxDB, TimescaleDB (PostgreSQL + the Timescale extension), Amazon Timestream, or Apache Druid for high-cardinality analytical workloads — which is the system of record for the hot and warm tiers. (5) The operations dashboard (Grafana is the near-universal choice) and the alerting engine query the time-series DB for live views and fire notifications — to ServiceNow to open an incident for a failing reefer, to PagerDuty/Opsgenie for the on-call, to the driver’s app for an in-cab warning.

Warm/cold path runs continuously alongside: (6) Downsampling jobs — continuous queries in InfluxDB, continuous aggregates in TimescaleDB, scheduled rollups in the stream processor — read raw data and write rollups (per-minute and per-hour min/max/mean/p95 per device) into separate, longer-retention tables. Retention policies then expire raw data after its short window while the rollups live for years. (7) In parallel, raw readings are archived from Kafka to a lakehouse — object storage (S3 / ADLS / GCS) in Apache Parquet, catalogued as Apache Iceberg or Delta tables — where the data-science team runs Spark/Databricks jobs for failure-prediction models, queries with Trino/Athena, and trains on the full-fidelity history the hot store no longer keeps.

The defining property of the whole topology: data is decoupled at every hop and tiered by age. The edge buffer decouples device from network; the Kafka log decouples ingestion from the database; downsampling decouples query cost from raw volume; the lakehouse decouples long-term analytics from the operational store. No single component’s failure or slowdown stalls the others, and the cost of keeping data falls as it ages.

Component breakdown

Component	Representative tooling	Role in the platform	Key configuration choices
Edge buffer / agent	Device firmware ring buffer, EdgeX Foundry, AWS IoT Greengrass, Azure IoT Edge	Collect, batch, persist locally, replay after connectivity gaps	Store-and-forward queue sized to expected outage; at-least-once delivery; local pre-aggregation on constrained links
Ingestion gateway	AWS IoT Core / Azure IoT Hub / GCP IoT, fronted by Akamai	Device auth, TLS termination, per-device throttle, fan-in	X.509 per-device certs; MQTT QoS 1; rate limit per device; reject on cert revocation
Streaming log	Apache Kafka / Confluent / MSK / Event Hubs	Durable, partitioned, replayable buffer between ingest and storage	Partition by device ID; replication factor 3; retention 24–72 h; tiered storage for cheap long retention
Stream processing	Apache Flink / Kafka Streams / Spark Structured Streaming	Validate, dedupe, windowed aggregation, alert evaluation	Exactly-once sinks; event-time windows with watermarks; idempotent writes keyed on (device, ts)
Time-series database	InfluxDB / TimescaleDB / Amazon Timestream / Druid	System of record for hot + warm; line-rate ingest, fast time-range aggregates	Hypertable/shard by time; columnar compression; tag/dimension design to bound cardinality
Downsampling / rollups	Continuous aggregates, continuous queries, scheduled jobs	Compute and store min/max/mean/p95 at coarser resolution	1-min and 1-h rollups; retention: raw days, rollups years; refresh near real time
Dashboards & alerting	Grafana, native alerting, PagerDuty, ServiceNow	Live operational views; threshold/anomaly alerts to humans and ITSM	Templated per-fleet dashboards; alert on rollups not raw; dedupe + escalation policy
Lakehouse / analytics	S3/ADLS/GCS + Parquet + Iceberg/Delta; Spark/Databricks; Trino/Athena	Full-fidelity long-term store; ML training; ad-hoc analytics	Partition by date+device prefix; compaction; schema evolution via Iceberg
Identity & secrets	Okta / Entra ID (humans), HashiCorp Vault (machines), cloud PKI (devices)	SSO for operators, dynamic credentials for services, cert issuance for devices	SSO + MFA; short-TTL DB creds from Vault; automated device-cert lifecycle
Security & observability	Wiz, CrowdStrike Falcon, Dynatrace/Datadog	Data-posture/CSPM, runtime threat detection, platform telemetry	Posture scan of data stores; runtime agents on brokers/processors; trace the full pipeline

A few choices deserve the why, because they are the ones teams get wrong.

Why a purpose-built time-series database, not a general one. Time-series data has structure a generic store cannot exploit: it is append-mostly, ordered by time, and adjacent readings from the same sensor are nearly identical. Purpose-built engines turn that into enormous wins — they shard by time so old data ages out by dropping whole chunks (no row-by-row deletes, no vacuum storms), and they compress columns of similar values 10–20x with delta and double-delta encoding. TimescaleDB keeps you in PostgreSQL (your team’s existing SQL, joins to relational metadata, the whole Postgres ecosystem) while adding hypertables and continuous aggregates; InfluxDB is leaner and ships with retention policies and downsampling as first-class concepts; Druid excels when queries slice across many dimensions at high cardinality; Timestream is the serverless AWS-native option when you want no nodes to run. The wrong move is forcing raw six-figure-per-second inserts into vanilla Postgres or MySQL and watching index maintenance melt it.

Why cardinality is the silent killer. The metric that decides whether a time-series database stays fast or falls over is series cardinality — the number of unique combinations of measurement plus tags/dimensions. Put a high-cardinality value (a raw GPS coordinate, a per-request UUID, a free-text status) into a tag and you create millions of distinct series; index memory explodes and queries crawl. The discipline: tags are for things you filter and group by with bounded values (device ID, fleet, region, sensor type); high-cardinality or continuously varying values go into fields (which are not indexed) or are bucketed. A logistics fleet of 40,000 trailers with a dozen sensor types is ~480k series — comfortable. The same fleet with raw lat/long as a tag is effectively infinite — a self-inflicted outage.

Why downsampling is not optional at scale. Raw fidelity matters for a short window — for forensic replay of an incident, for training a model on high-frequency signal — but almost nobody queries last March’s data at five-second resolution, and storing it hot to allow that is ruinously expensive. Downsampling keeps raw data for days to weeks, then serves everything older from rollups: per-minute and per-hour summaries that are 60x to 720x smaller. A year of per-hour rollups for 480k series is gigabytes, not the dozens of terabytes the raw would be. The query layer transparently picks the right resolution for the time range — raw for “last hour,” hourly rollup for “last quarter” — so dashboards stay fast over any window. Skip this and either your storage bill or your query latency (usually both) eventually forces an emergency re-architecture.

Why the edge buffer earns its keep. Connectivity for mobile and remote assets is unreliable by nature — trucks drive through tunnels, turbines sit on hilltops with flaky backhaul, ships cross oceans. A device that streams with no local buffer loses every reading taken during a dropout, permanently. A store-and-forward buffer (a bounded on-disk queue) holds readings through the outage and replays them when the link returns, turning data loss into data delay. Size the buffer to the worst plausible outage (a multi-hour dead zone), and on truly constrained links, pre-aggregate at the edge — ship one-minute summaries plus exception events rather than every raw point — to cut backhaul cost without losing the signal that matters.

Implementation guidance

Provision with IaC, and treat ingestion ordering as a contract. Use Terraform — providers exist for every cloud’s IoT service, for Confluent Cloud and MSK, for the managed databases, and for Grafana Cloud — so the whole platform is reproducible and reviewable. The deployment order matters: the streaming log and its topic/partition layout come first, because everything downstream is a consumer of it, and re-partitioning a live Kafka topic is painful. Key by device ID from day one so a device’s readings stay strictly ordered within a partition; that ordering is what lets the stream processor dedupe and window correctly.

Design the schema around time and bounded tags. In TimescaleDB this is a hypertable plus continuous aggregates; the rollup is the cost-control mechanism, so it ships with the schema, not as an afterthought:

-- Raw readings: a hypertable chunked by time, compressed after a day
CREATE TABLE reading (
  ts          TIMESTAMPTZ NOT NULL,
  device_id   TEXT        NOT NULL,   -- bounded: ~40k values -> safe as a dimension
  sensor      TEXT        NOT NULL,   -- bounded: temp, humidity, door, fuel...
  value       DOUBLE PRECISION
);
SELECT create_hypertable('reading', 'ts', chunk_time_interval => INTERVAL '6 hours');
ALTER TABLE reading SET (timescaledb.compress, timescaledb.compress_segmentby = 'device_id,sensor');
SELECT add_compression_policy('reading', INTERVAL '1 day');

-- Hourly rollup: continuous aggregate, refreshed near real time
CREATE MATERIALIZED VIEW reading_1h
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', ts) AS bucket, device_id, sensor,
       min(value), max(value), avg(value),
       approx_percentile(0.95, percentile_agg(value)) AS p95
FROM reading GROUP BY 1, 2, 3;

-- Keep raw 14 days; keep the cheap rollup for 3 years
SELECT add_retention_policy('reading', INTERVAL '14 days');
SELECT add_retention_policy('reading_1h', INTERVAL '3 years');

Make writes idempotent. Devices replay buffered data after an outage and stream processors may reprocess on recovery, so the same reading can arrive twice. Key every write on (device_id, sensor, ts) and use an idempotent sink (upsert, or exactly-once Kafka→DB delivery in Flink) so duplicates collapse instead of doubling your data. Validation in the stream processor should also drop readings with timestamps wildly in the future or past — a device with a wrong clock will otherwise poison your time-ordering.

Wire alerting off rollups, not raw. Evaluate alert conditions in the stream processor’s windows where you can (sub-second, before the data even lands), and for dashboard-driven alerts query the one-minute rollup rather than scanning raw points — it is both faster and less noisy. Route machine-actionable alerts to ServiceNow to open and track an incident with an audit trail, and human-actionable ones to PagerDuty/Opsgenie with a sane escalation and dedupe policy so a fleet-wide event does not page the on-call forty thousand times.

Enterprise considerations

Security & identity. Three distinct identity problems, three tools. Devices authenticate with per-device X.509 certificates issued and rotated by the cloud PKI (IoT Core / IoT Hub provisioning) — a compromised trailer’s cert is revoked without touching the rest of the fleet, and a shared symmetric key across the fleet is the anti-pattern that turns one stolen device into a fleet breach. Human operators sign in to Grafana, ServiceNow, and the cloud consoles through Okta or Entra ID SSO with MFA, so access is centrally governed and instantly revocable on offboarding. Services — the stream processors and rollup jobs reaching the database — pull short-TTL credentials from HashiCorp Vault rather than carrying static passwords, so a leaked credential expires in minutes and rotation is automatic. (This platform’s own history is the cautionary tale: long-lived database passwords once sat in source control; dynamic Vault-issued creds are precisely the fix.) On top of that, Wiz continuously scans the data stores and lakehouse buckets for posture drift — a publicly exposed S3 archive of telemetry, an over-broad IAM role, an unencrypted snapshot — and CrowdStrike Falcon runs on the broker and processing nodes for runtime threat detection. Encrypt in transit (TLS device-to-gateway and service-to-service) and at rest (cloud KMS), and tag data by tenant/region so residency rules are enforceable.

Cost optimization. Storage and data movement dominate, and both grow with the fleet, so engineer for them from day one. (1) Tiering by age is the biggest lever — raw hot for days, rollups warm for years, Parquet in object storage for cold archive at a tiny fraction of the per-GB cost; Kafka tiered storage does the same for the log so you keep long retention without paying for hot broker disk. (2) Downsampling aggressively — every channel you keep at full resolution past its useful window is pure waste; tune raw retention to your actual forensic and training needs (often 7–30 days) and let rollups carry the long tail. (3) Edge pre-aggregation cuts both ingestion and egress on metered cellular links — ship summaries plus exceptions, not every raw point, where the use case allows. (4) Compression is free quality — segment-by the right columns so similar values pack tightly (10–20x is normal for sensor data). (5) Right-size the time-series cluster to steady write load, not peak, and absorb bursts in Kafka rather than in over-provisioned database nodes.

Scalability. Each tier scales independently. The gateway scales horizontally with the fleet (cloud IoT services are effectively elastic). Kafka scales by adding partitions and brokers — partition count is your ingest-parallelism ceiling, so provision headroom because growing partitions on a live topic is disruptive. Stream processing (Flink/Spark) scales by parallelism, naturally bounded by partition count. The time-series database scales by sharding across time and (in clustered editions) across nodes; the real scaling discipline is cardinality control — bound your series count and the cluster stays linear, let cardinality explode and no amount of hardware saves you. The lakehouse scales effectively infinitely on object storage, with compute (Spark/Trino) decoupled and spun up per job.

Reliability & DR (RTO/RPO). Decide the numbers per tier and exploit the natural buffers. The edge buffer plus the Kafka log already give you a strong RPO on the hot path: if the database is down, ingestion keeps flowing into the log (and into device buffers), and you replay on recovery with zero data loss as long as the outage is shorter than Kafka retention. Run Kafka with replication factor 3 across availability zones. The time-series database replicates to a standby (streaming replication for TimescaleDB/Postgres; multi-node or cross-region for the others) for fast failover. The lakehouse on geo-redundant object storage is your durable source of truth — full-fidelity raw data is recoverable from it even if the operational database is lost, which makes the hot store rebuildable rather than precious. A pragmatic enterprise target: RTO 15 minutes (failover the database, redirect consumers) and RPO near zero on the hot path because Kafka and the edge buffer cover the gap, with the analytical store reconstructable from object storage within hours. Test it with a game day: kill the primary database, confirm Kafka backs up without loss, fail over, and confirm consumers replay cleanly.

Observability. Instrument the pipeline end to end with Dynatrace or Datadog: trace a reading from gateway through Kafka through the stream processor into the database, with consumer lag as the headline metric — rising lag is the leading indicator that your processing or database write path cannot keep up, the canary before data goes stale. Emit the metrics the business actually feels: ingestion rate and drop/reject rate at the gateway, end-to-end latency (sensor timestamp to query-visible), series cardinality (watch it like a hawk; alert on growth), rollup freshness (is the continuous aggregate caught up), and storage per tier and per tenant for chargeback. Grafana serves the operational fleet views; the platform-health views belong in your APM so the team sees the pipeline itself, not just the cargo.

Governance & data quality. Telemetry is messy — clock skew, stuck sensors reporting a flatline, out-of-range spikes, gaps. Validate at the stream-processing stage (range checks, rate-of-change checks, flag-don’t-drop where a human should review) so downstream consumers trust the data. Catalog the lakehouse tables (Iceberg/Glue/Unity Catalog) with schema and lineage so data scientists know what each channel means and when its semantics changed. Define retention and residency as policy — raw data lives N days, rolled-up data M years, EU device data stays in EU regions — and enforce it through automated retention jobs and Wiz posture rules rather than tribal knowledge. For regulated cold chains, keep the rollups and exception events for the full audit-retention period and make them queryable by shipment, because the auditor’s question is “prove this load stayed in band,” and the answer must be one query away.

Reference enterprise example

Northwind Cold Chain, a fictional North American refrigerated-logistics operator (~40,000 reefer trailers, ~3,500 employees), built this platform to satisfy FDA/FSMA cold-chain audit requirements and to cut spoilage claims. Their telemetry: a dozen channels per trailer at a 5-second cadence — temperature, humidity, door state, GPS, fuel, compressor health — about 100,000 data points per second at peak, with frequent multi-hour connectivity gaps as trucks cross remote stretches.

Decisions they made. Every trailer’s gateway ran a store-and-forward buffer sized to eight hours of outage, shipping over MQTT QoS 1 to AWS IoT Core fronted by Akamai for connections in border and port regions. Messages landed on MSK (Kafka), 64 partitions keyed by trailer ID, replication factor 3, 48-hour retention with tiered storage. Flink validated, deduplicated the inevitable post-outage replays on (trailer, sensor, ts), and evaluated the core compliance rule — temperature outside the SLA band for more than 10 minutes — firing an alert into ServiceNow to open a tracked incident and into the dispatcher’s console. Readings landed in TimescaleDB (a 3-node cluster) as a hypertable, compressed after a day, with continuous aggregates producing 1-minute and 1-hour rollups. Retention: raw for 14 days, rollups for 3 years (their audit-retention requirement). Raw data was archived from Kafka to S3 as Parquet / Iceberg, where the data-science team trained a compressor-failure model on full-fidelity vibration and current-draw history. Operators used Grafana SSO’d through Okta; the Flink and rollup jobs drew short-TTL Postgres credentials from HashiCorp Vault; Wiz watched the S3 archive and IAM posture; CrowdStrike Falcon ran on the MSK and Flink nodes; Dynatrace traced the pipeline with consumer lag on the main dashboard.

The numbers. ~100k points/sec sustained, ~8.6 billion readings a day. Sensor-to-dashboard latency p95 ~4 seconds. Raw storage stayed near ~2.5 TB hot (14-day window, compressed ~14x); three years of rollups added only single-digit terabytes; the cold Parquet archive grew at pennies-per-GB and was queried only by batch jobs. Monthly run cost landed near ₹38 lakh (~$45,500): MSK + tiered storage ~$11,000, the TimescaleDB cluster ~$9,000, IoT Core + Akamai ingestion ~$8,000, Flink compute ~$6,000, S3/lakehouse + Spark ~$5,000, Grafana/Dynatrace/Vault/Wiz/Falcon the remainder. Downsampling plus tiering was the difference between this budget and one several times larger — keeping three years of raw 5-second data hot would have cost more in storage alone than the entire platform.

The outcome. Audit evidence that used to take a data analyst two days to assemble from raw logs became a single per-shipment query returning the temperature envelope and any excursion events with timestamps — and the FDA auditor accepted it. Spoilage claims fell ~30% because the 10-minute-out-of-band alert reached a dispatcher in time to reroute or service a failing reefer instead of discovering a ruined load on delivery. The compressor-failure model, trained on the lakehouse history, began flagging units ~2 weeks before failure, converting unplanned roadside breakdowns into scheduled maintenance. In a DR game day they killed the primary TimescaleDB node; Kafka and the trailer buffers absorbed the gap with zero data loss, the standby took over in ~11 minutes, and consumers replayed cleanly from the log.

When to use it

Use this architecture when you have a fleet of physical assets emitting measurements faster than a general database can absorb or query; you need both a live operational view (seconds-fresh dashboards and alerts) and long-term analytics on the same data; your devices live on unreliable links and cannot afford to lose data during outages; and storage cost forces you to keep raw fidelity briefly but summaries for years. That covers most industrial-IoT and observability-at-scale demand — connected vehicles and cold chain, wind and solar fleets, smart metering, manufacturing-line sensors, building management, and medical-device telemetry.

Trade-offs to accept. This platform has real moving parts — an edge agent to maintain, a Kafka cluster to operate, stream-processing jobs, a time-series database to tune, and a lakehouse to govern. Downsampling means old data is lossy by design: you keep summaries, not every raw point, so a forensic question that needs five-second resolution from last year cannot be answered unless you archived raw to the lakehouse (which is why you do). End-to-end latency is the sum of edge batching, log, and processing — typically seconds, which is right for operations but not for hard-real-time control loops (those belong on the device or a local PLC, not a cloud round-trip).

Anti-patterns. (1) High-cardinality tags — raw GPS or UUIDs as indexed dimensions will detonate your time-series database; bound your series count. (2) No edge buffer — every connectivity gap becomes permanent data loss, fatal for a regulated cold chain. (3) Keeping raw data hot forever — storage or query cost eventually forces an emergency rebuild; downsample and tier from day one. (4) A general-purpose database at the hot tier — vanilla Postgres/MySQL melts under sustained six-figure inserts; use a purpose-built engine. (5) Non-idempotent writes — buffered replays and reprocessing double your data; key on (device, sensor, ts). (6) Alerting on raw scans — slow and noisy; alert in the stream or off rollups. (7) Static service credentials — issue short-TTL creds from Vault, not passwords in config.

Alternatives, and when they win. If your data rate is modest — a few thousand points per second, a few hundred devices — TimescaleDB on a single managed Postgres with no Kafka and no Flink is dramatically simpler and entirely sufficient; add the streaming tier only when ingest bursts or producer/consumer decoupling demand it. If you are all-in on one cloud and want zero infrastructure to run, the serverless managed services (Amazon Timestream, Kinesis, managed Flink; or the Azure/GCP equivalents) trade some control and portability for operational simplicity. If your real need is application and infrastructure metrics rather than physical-sensor telemetry, a purpose-built observability stack (Prometheus + Thanos/Mimir, or a SaaS APM) is the better-fit tool. And if you need sub-millisecond closed-loop control, that logic belongs at the edge on the device or a local controller — a cloud time-series platform is for monitoring, analytics, and audit, not for actuating a control loop in real time. The architecture here is the destination for industrial-scale telemetry; start simpler and graduate into it as volume, reliability, and governance demand.

A Time-Series IoT Data Platform

The business scenario

Architecture overview

Component breakdown

Implementation guidance

Enterprise considerations

Reference enterprise example

When to use it

Written by Vinod

Comments

Keep Reading

Data Contracts and Schema Registry for Reliable Pipelines

Data Quality and Observability Architecture

Enterprise Data Catalog, Lineage and Governance