A global logistics company — think parcels, cross-border freight, and a customer-facing tracking page that 30 million shippers refresh during the holiday peak — has grown by acquisition into a three-cloud mess. The core shipment-routing platform runs on AWS. A regional acquisition brought a billing and customs stack on Azure. The data-science team that predicts delivery ETAs lives on GCP because that is where their BigQuery warehouse already was. Each cloud has its own native monitoring — CloudWatch, Azure Monitor, Cloud Monitoring — and each team built dashboards in the tool nearest to hand. The result is the failure mode every VP of Engineering eventually hits: during the last peak, a customs-clearance slowdown on Azure cascaded into stalled routing on AWS, and it took 51 minutes to correlate the two because nobody could see both clouds on one screen. The mandate from the new SRE leadership is blunt: one place to see everything, one place to be paged, and an honest number for “are we meeting our promise to customers.” This article is the reference architecture for making Datadog that single pane of glass across AWS, Azure, GCP, and Kubernetes — and, just as importantly, for wiring it into the identity, security, and incident-management fabric the rest of the company already runs.
The pressures here are operational rather than regulatory, but they are no less unforgiving. Heterogeneity means three different metric models, three log formats, and three tagging conventions that must be reconciled into one vocabulary or the dashboards are worthless. Scale means tens of thousands of containers across dozens of Kubernetes clusters emitting metrics every fifteen seconds, plus high-cardinality logs from a tracking API that spikes 8× at peak. Cost means observability spend that, left ungoverned, can quietly rival the compute bill it is meant to watch. And time-to-detect means the difference between a customer noticing a stuck parcel and an on-call engineer noticing first. A single-pane strategy satisfies all four only if it is built deliberately — not by clicking “enable integration” on three clouds and hoping the dashboards assemble themselves.
Why not the obvious shortcuts
Three tempting non-answers will be proposed in the first planning meeting, and each fails in a way worth naming.
“Just use each cloud’s native monitoring and put links in a wiki.” This is the status quo that produced the 51-minute incident. Native tools are excellent inside their own cloud and blind outside it; there is no shared trace that follows a request from the AWS routing service into the Azure billing call, and no single alert that fires on the combination. Correlation becomes a human tab-switching exercise during the exact moment humans are worst at it.
“Aggregate everything into a self-hosted stack — Prometheus plus Grafana plus Loki plus Tempo — and own it.” This is a legitimate architecture and cheaper in license terms, but it trades a license bill for an operations bill: you now run, scale, and shard a metrics store, a log store, and a trace store, federate Prometheus across clouds, and staff the team that keeps that pipeline alive at 3 a.m. For a company whose core competency is moving parcels, not running a time-series database, that is the wrong place to spend headcount.
“Pick one cloud’s tool and force the other two into it.” Azure Monitor will technically ingest AWS metrics through connectors, but it treats them as second-class, the tagging never lines up, and you have hitched your cross-cloud observability to a vendor with a structural incentive to make the other clouds look harder to operate. A neutral plane that treats all three as first-class is the point.
Datadog threads the needle by being cloud-agnostic by design: it pulls metrics and logs from each cloud’s own APIs through first-party integrations, accepts open-standard telemetry (OpenTelemetry, StatsD, Prometheus scrape) from anything those integrations miss, and unifies it all under one tag model, one query language, one alerting engine, and one incident workflow. The clouds stay where they are; Datadog supplies the shared lens.
Architecture overview
The platform has three logical layers that are useful to hold separately: a collection layer that gets telemetry out of each cloud, a Datadog platform layer that stores, correlates, and visualizes it, and an action layer that turns signal into a human being doing something. Telemetry flows up; alerts and identity flow down.
The defining design choice is the one that makes the whole thing coherent: a single, enforced tag taxonomy applied at the source across all three clouds. Every metric, log, and span carries env, service, cloud, region, team, and business_unit — the same keys, the same values, everywhere. Without this, a “single pane” is three panes sharing a URL. This taxonomy is not a Datadog feature you toggle; it is a contract enforced in your infrastructure-as-code, which is why Terraform appears so early below.
Collection layer, following the data flow per cloud:
- AWS. The AWS integration assumes a cross-account IAM role and pulls CloudWatch metrics, plus inventory and tags, for hundreds of services without an agent. For richer signal, CloudWatch logs and events stream through a Datadog Forwarder Lambda (or Kinesis Firehose for volume), and EC2/ECS hosts run the Datadog Agent for process-level metrics and live tracing.
- Azure. The Azure integration registers an Entra ID app (service principal) with
Monitoring Readeracross the relevant subscriptions and pulls Azure Monitor metrics and Activity Logs. AKS nodes and VMs run the Agent; Azure platform logs route via Event Hub to Datadog. - GCP. The GCP integration uses a service account with
Monitoring Viewer/Compute Viewerto pull Cloud Monitoring metrics; logs ship through a Pub/Sub → Dataflow push to Datadog. GKE nodes run the Agent. - Kubernetes (all clouds). Each cluster runs the Datadog Agent as a Daemonset plus the Cluster Agent, deployed by Helm. They auto-discover pods, scrape Prometheus-annotated endpoints, tail container logs, and run the APM tracing library injected into application pods — so a distributed trace follows a tracking request from the GKE front end through the EKS routing service into the AKS billing call, as one waterfall.
- Anything else — legacy virtual appliances, on-prem load balancers, network gear — emits via OpenTelemetry to the Datadog OTLP endpoint or the OpenTelemetry Collector with the Datadog exporter, so nothing is structurally excluded from the pane.
Platform layer. Datadog ingests all of the above, indexes metrics and (selectively) logs, stitches traces, and — critically — applies tag-based correlation so a single dashboard widget can group “p99 latency by service across all clouds” without caring where the data physically originated. Service Level Objectives are defined on top of these signals, monitors evaluate continuously, and Synthetics run scripted checks against customer-facing endpoints from outside every cloud.
Action layer. When a monitor breaches, Datadog routes through PagerDuty for human paging (on-call schedules, escalation policies) and opens or updates a record in ServiceNow for the auditable incident trail and change correlation. Identity into Datadog itself is brokered by Okta (or Entra ID) via SAML/SCIM, so access and team mapping match the rest of the org.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| AWS collection | Datadog AWS integration + Forwarder Lambda + Agent | Pull CloudWatch metrics/inventory; stream logs/events; host & APM metrics | Cross-account IAM role; namespace allow-list to control cost; ECS/EKS Agent |
| Azure collection | Datadog Azure integration + Entra app + Agent | Pull Azure Monitor metrics + Activity Log; AKS/VM host metrics | Monitoring Reader SP; Event Hub log route; per-subscription scoping |
| GCP collection | Datadog GCP integration + Pub/Sub log push + Agent | Pull Cloud Monitoring metrics; ship logs; GKE host metrics | Monitoring Viewer SA; Dataflow log sink; label-to-tag mapping |
| Kubernetes | Datadog Agent Daemonset + Cluster Agent (Helm) | Auto-discovery, log tailing, Prometheus scrape, APM tracing | Cluster Agent for scale; admission controller for trace lib injection |
| Open telemetry | OpenTelemetry Collector / OTLP, StatsD | Catch-all for appliances, legacy, custom app metrics | Datadog exporter; unified tag processor; tail-based sampling |
| Storage & correlation | Datadog platform (Metrics, Logs, APM) | Unified store, one query language, tag-based correlation | Logging-without-Limits (ingest vs. index split); retention tiers |
| Dashboards | Datadog Dashboards + Watchdog | Single-pane views; automated anomaly surfacing | Template variables on env/cloud/service; SLO widgets |
| Reliability targets | Datadog SLOs + Monitors | Honest customer-promise metric; burn-rate alerting | Metric- and monitor-based SLOs; multi-window burn-rate monitors |
| Black-box checks | Datadog Synthetics | External validation of customer endpoints from outside every cloud | Multistep API + browser tests; private locations for internal apps |
| Human paging | PagerDuty | On-call schedules, escalation, acknowledgement | Severity-mapped routing; event rules; bidirectional ack sync |
| ITSM record | ServiceNow | Auditable incident/change trail, post-incident review | Auto-create Incident on SEV; link to change records |
| Identity / SSO | Okta + Entra ID | SSO and team provisioning into Datadog | SAML login; SCIM team sync; group-to-role mapping |
| Secrets | HashiCorp Vault | Hold cloud integration creds, API/app keys for IaC | Dynamic leases; Agent secret backend; no keys in Helm values |
| Posture & runtime | Wiz + CrowdStrike Falcon | Cloud posture findings and runtime detections into Datadog | Wiz → Datadog event stream; Falcon detections as correlated signal |
| CI / IaC | GitHub Actions / Jenkins + Terraform / Ansible | Manage integrations, monitors, SLOs, dashboards as code | Datadog provider; CI deploy-tracking events; monitor-as-code review |
A handful of these choices carry the weight of the design, and they are the ones teams get wrong.
Why agentless integrations and the Agent, not one or the other. The cloud integrations are agentless and effortless — assume a role, get hundreds of services’ metrics — but they read from CloudWatch/Azure Monitor/Cloud Monitoring, which means you inherit those platforms’ 1–5 minute granularity and you pay per CloudWatch API call. The Datadog Agent running on hosts gives 15-second granularity, process-level detail, live process and network maps, and APM tracing the cloud APIs cannot. The right answer is both: integrations for breadth across every managed service, Agents for depth on the workloads you actually operate. Treating it as either/or either blinds you to managed services or bankrupts you on CloudWatch API charges.
Why the tag taxonomy is enforced in IaC, not requested in a wiki. Each cloud names things differently — AWS tags, Azure tags, GCP labels — and humans tag inconsistently under deadline. If service is routing-svc on AWS and RoutingService on Azure, no dashboard groups them. The fix is to apply Unified Service Tagging (env, service, version) plus your business tags at provisioning time in Terraform, and to use Datadog’s tag-remapping in each integration to normalize the inevitable drift. The reconciliation table below is the kind of mapping you maintain deliberately:
| Concept | AWS | Azure | GCP | Normalized Datadog tag |
|---|---|---|---|---|
| Environment | Environment tag |
env tag |
env label |
env |
| Service | Service tag |
app tag |
service label |
service |
| Region | AWS region | Azure location | GCP region | region |
| Cost owner | CostCenter tag |
costCenter tag |
cost-center label |
business_unit |
Why SLOs, not just CPU dashboards. The leadership question is “are we meeting our promise to customers,” and CPU graphs do not answer it. Define the Service Level Objective on the indicator the customer feels — for the tracking API, “99.5% of requests succeed and return in under 800 ms over 30 days” — and let Datadog compute the error budget and burn rate. This is what makes a single pane useful rather than merely unified: a row of SLO status widgets tells the VP at a glance whether the company is keeping its word, across all three clouds, on one screen.
Implementation guidance
Provision the integrations and the monitoring estate with Terraform — treat them as code, not console clicks. Observability config that lives only in a UI rots, drifts, and cannot be reviewed. The Datadog Terraform provider manages integrations, monitors, SLOs, and dashboards alongside the infrastructure they watch.
A minimal shape for the AWS integration and a multi-cloud SLO communicates the intent:
resource "datadog_integration_aws_account" "logistics_prod" {
account_id = "210987654321"
auth_config {
aws_auth_config_role {
role_name = "DatadogIntegrationRole" # cross-account, read-only
}
}
metrics_config {
namespace_filters {
include_only = ["AWS/ECS", "AWS/ApplicationELB", "AWS/RDS", "AWS/Lambda"]
} # allow-list to control CloudWatch cost
}
resources_config { cloud_security_posture_management_collection = false }
}
resource "datadog_service_level_objective" "tracking_api" {
name = "Tracking API availability"
type = "metric"
description = "99.5% success over 30d across all clouds"
query {
numerator = "sum:trace.http.request.hits{service:tracking-api,!http.status_class:5xx}.as_count()"
denominator = "sum:trace.http.request.hits{service:tracking-api}.as_count()"
}
thresholds { timeframe = "30d" target = 99.5 warning = 99.7 }
tags = ["business_unit:tracking", "env:prod"]
}
Note the SLO query does not mention a cloud at all — it groups by service, and because every cloud stamps service:tracking-api identically, the objective spans AWS, Azure, and GCP transparently. That is the single pane paying off in one resource block.
The collection pipeline, cloud by cloud, with the gotchas.
- AWS: create the cross-account
DatadogIntegrationRolewith a read-only policy, and immediately set anamespace_filtersallow-list. The single most common cost surprise on this architecture is leaving every CloudWatch namespace enabled and paying for API calls on services you never look at. Stream high-value logs through the Forwarder Lambda; do not index everything. - Azure: register the Entra ID app, grant
Monitoring Readerat the management-group level so new subscriptions are covered automatically, and route diagnostic logs through Event Hub. Scope tightly — an over-broad SP is a finding Wiz will (correctly) raise. - GCP: bind a service account with
Monitoring ViewerandCompute Viewer, and stand up the Pub/Sub → Dataflow log push if you want GCP logs in the same pane; map GCPlabelsto Datadog tags in the integration config. - Kubernetes: deploy the Agent via the official Helm chart with the Cluster Agent enabled (it is what lets one deployment scale to thousands of pods without each node Agent hammering the API server), turn on APM and the admission controller so tracing libraries inject automatically, and enable Logging-without-Limits so you ingest all container logs but index only the ones worth querying.
Secrets and credentials: do not paste keys into Helm values. The cloud integration credentials, the Datadog API and app keys used by Terraform, and any third-party tokens live in HashiCorp Vault, leased dynamically and pulled by the Agent’s secret backend or injected into CI, so nothing sensitive sits in a values file, a pipeline variable, or — the lesson nobody wants to relearn — a git history. Login to Datadog itself federates through Okta over SAML with SCIM provisioning, so when someone joins or leaves the SRE team, their Datadog access and team membership change with their Okta record, not by a manual admin edit.
Synthetics: watch from outside, like a customer does. Internal metrics can be green while the customer-facing tracking page is down behind a broken CDN or DNS record. Configure Datadog Synthetics multistep API and browser tests against the public tracking endpoint from multiple managed locations worldwide, and private locations for internal apps, so you detect a customer-visible outage even when every backend dashboard looks healthy. Pair Synthetics with the Akamai layer: when an edge or origin-shield problem shows up, the Synthetic failure and the Akamai-side signal correlate on the same timeline.
Enterprise considerations
Security and access. A monitoring plane that can read every cloud is a high-value target, so treat it as one. Datadog access is SSO-only via Okta/Entra ID with SCIM-synced teams and least-privilege roles (read-only for most, monitor-edit for SRE, admin for a named few). Every cloud integration uses a read-only, scoped role or service principal — Monitoring Reader, not Contributor — and those credentials live in Vault. Fold security signal into the same pane rather than running it in a silo: Wiz posture findings (a public S3 bucket, an over-permissioned SP, a drifted ACL) flow to Datadog as events so an SRE triaging an incident sees the relevant misconfiguration on the same timeline, and CrowdStrike Falcon runtime detections on the Kubernetes nodes correlate against the workload metrics — a CPU spike that coincides with a Falcon detection is a very different page than a CPU spike alone. The goal is one timeline where reliability and security signals sit side by side, because at 3 a.m. an engineer should not be cross-referencing four tools.
Cost optimization. Observability spend is usage-based and grows with the estate, so engineer for it from day one or it surprises you on the next invoice.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Namespace allow-lists | Pull only CloudWatch/Monitor namespaces you use | Cuts integration metric + API cost sharply |
| Logging-without-Limits | Ingest all logs, index only what you query | 60–80% of logs ingested but never indexed |
| Custom-metric hygiene | Control high-cardinality tags; drop unused metrics | Custom metrics are the top runaway cost line |
| Trace sampling | Tail-based sampling keeps errors + slow traces, drops the rest | Retains signal at a fraction of span volume |
| Tiered retention | Short retention for noisy logs, long for audit-relevant | Pays for retention only where it earns its keep |
The discipline that matters most: custom metrics and high-cardinality tags are the line item that quietly explodes. A single metric tagged with user_id or shipment_id can mint millions of time series. Govern tag cardinality in the same IaC review that governs everything else, and watch your own Datadog usage in Datadog — the platform meters itself.
Scalability. Each layer scales independently. Agentless integrations scale with API quotas — mind CloudWatch GetMetricData limits and request quota increases before a big onboarding. The Cluster Agent is the key to Kubernetes scale: it centralizes cluster-level queries so node Agents do not each pummel the API server, letting one deployment cover thousands of pods. Ingestion scales on Datadog’s side; your job is to govern what you send (sampling, allow-lists, index filters) rather than to scale a backend you do not run — which is precisely the operational burden the self-hosted alternative would have handed you.
Failure modes, and what each one looks like. Name them before they page you.
- Integration credential expiry or revocation — an Entra app secret rotates, the Azure integration silently stops pulling, and a whole cloud goes dark on the dashboard with no error. Mitigation: monitor the integration’s own health metric, lease credentials from Vault with managed rotation, and alert on “no data” for each cloud.
- Agent version skew across clusters — different Agent versions across EKS/AKS/GKE produce subtly different tags or missing features, fracturing correlation. Mitigation: pin and roll Agent versions through Argo CD or the Helm pipeline so all clusters move together.
- The monitoring monoculture — Datadog itself has an incident and you are blind during the exact window you most need to see. Mitigation: keep Synthetics from independent locations and a thin secondary signal (cloud-native alarms on the two or three most critical SLIs) as a backstop, and define an explicit “observability is down” runbook.
- Alert storms — a shared dependency fails and 200 monitors fire at once, burying the one that matters. Mitigation: composite monitors, dependency-aware grouping, and PagerDuty event rules that suppress downstream noise.
- Tag drift — a new service ships with
service:RoutingSvcinstead ofrouting-svcand falls out of every dashboard and SLO. Mitigation: enforce Unified Service Tagging in Terraform and fail CI on missing required tags.
Observability of the observability — alert routing that respects humans. Map monitor severity to action deliberately: a SEV-1 SLO burn pages on-call through PagerDuty with full escalation and auto-opens a ServiceNow incident for the auditable trail and post-incident review; a SEV-3 posts to a Slack channel and opens a low-priority ServiceNow ticket; an informational anomaly from Watchdog (Datadog’s automated anomaly detection) just annotates the dashboard. Bidirectional sync matters — acknowledging in PagerDuty should reflect in Datadog, and resolving the Datadog monitor should close the loop in ServiceNow — so the three tools tell one story rather than three. Tie deployments in too: emit a deploy-tracking event from GitHub Actions (or Jenkins) on every release so a latency regression on the dashboard lines up against the exact deploy that caused it, turning “when did this start” from an investigation into a glance.
Governance. Manage monitors, SLOs, and dashboards as code through the Datadog Terraform provider, reviewed in pull requests like any other change, so an alert threshold cannot be quietly weakened without a reviewer seeing it. Standardize dashboards with template variables on env, cloud, and service so one dashboard serves every team rather than each team forking their own. Configure cloud resources and Agent rollouts with Ansible where Terraform is not the right fit (host-level Agent config, appliance onboarding). And keep the tag taxonomy itself versioned and reviewed — it is the schema the entire single pane depends on, and it deserves the same rigor as a database migration.
Explicit tradeoffs
Accept these or do not build it. A single pane on Datadog means a meaningful, usage-based vendor bill and a real degree of vendor lock-in — your dashboards, monitors, and SLO definitions are expressed in Datadog’s model, and migrating off is non-trivial. You inherit the granularity and cost characteristics of each cloud’s metric API for the agentless portion. You take on a genuine cost-governance discipline — custom metrics, log indexing, and trace volume will grow with success and must be actively managed, not set and forgotten. And you accept a monitoring monoculture risk that has to be explicitly mitigated with independent Synthetics and a thin native-alarm backstop. The reconciliation work — forcing three clouds’ tagging into one taxonomy — is real engineering, not a toggle, and it never fully ends because new services keep arriving.
The alternatives, and when they win. If your estate is genuinely single-cloud, the native tool (CloudWatch, Azure Monitor, Cloud Monitoring) is cheaper, deeply integrated, and the cross-cloud unification Datadog sells you is value you would not use. If you have the platform-engineering headcount and a strong cost incentive, a self-hosted Prometheus + Grafana + Loki + Tempo + Mimir stack gives you control and a lower license bill in exchange for the operational burden of running it — a fair trade for a company whose core competency is infrastructure. If you only need black-box uptime and a status page, a lightweight synthetic-monitoring-only service is far simpler than a full observability platform. And Dynatrace is the closest peer to Datadog in this multi-cloud single-pane role — its automatic dependency mapping and AI root-cause analysis are genuinely strong — so the choice between them often comes down to existing relationships, APM depth needs, and pricing rather than capability gaps. Datadog earns the slot here because the breadth of first-party integrations, the SLO and incident tooling, and the synthetic-plus-APM-plus-logs unification line up exactly with a three-cloud company that needs one screen and does not want to run a database to get it.
The shape of the win
For the logistics company’s SRE org, the payoff is not “nicer dashboards.” It is that the next time a customs-clearance slowdown on Azure starts to back up routing on AWS, one engineer looking at one dashboard sees both clouds’ latency climbing on the same timeline, the burn-rate monitor on the cross-cloud tracking SLO pages the right on-call through PagerDuty inside two minutes instead of fifty-one, the deploy-tracking event rules out a release as the cause in one glance, and the ServiceNow incident writes itself for the post-mortem — all because every cloud speaks the same tag language into the same plane. The 51-minute correlation gap that started this project closes to single-digit minutes, and the VP gets the one thing actually asked for: a single screen that answers, honestly and across every cloud, “are we keeping our promise to customers.” Everything upstream — the agentless integrations, the Agents on Kubernetes, the OpenTelemetry catch-all, the Okta SSO, the Vault-held credentials, the Wiz and Falcon signals folded into the same timeline — exists to make that one screen true.