Datadog as the Single Pane of Glass for Multi-Cloud Operations

A global logistics company — think parcels, cross-border freight, and a customer-facing tracking page that 30 million shippers refresh during the holiday peak — has grown by acquisition into a three-cloud mess. The core shipment-routing platform runs on AWS. A regional acquisition brought a billing and customs stack on Azure. The data-science team that predicts delivery ETAs lives on GCP because that is where their BigQuery warehouse already was. Each cloud has its own native monitoring — CloudWatch, Azure Monitor, Cloud Monitoring — and each team built dashboards in the tool nearest to hand. The result is the failure mode every VP of Engineering eventually hits: during the last peak, a customs-clearance slowdown on Azure cascaded into stalled routing on AWS, and it took 51 minutes to correlate the two because nobody could see both clouds on one screen. The mandate from the new SRE leadership is blunt: one place to see everything, one place to be paged, and an honest number for “are we meeting our promise to customers.” This article is the reference architecture for making Datadog that single pane of glass across AWS, Azure, GCP, and Kubernetes — and, just as importantly, for wiring it into the identity, security, and incident-management fabric the rest of the company already runs.

The pressures here are operational rather than regulatory, but they are no less unforgiving. Heterogeneity means three different metric models, three log formats, and three tagging conventions that must be reconciled into one vocabulary or the dashboards are worthless. Scale means tens of thousands of containers across dozens of Kubernetes clusters emitting metrics every fifteen seconds, plus high-cardinality logs from a tracking API that spikes 8× at peak. Cost means observability spend that, left ungoverned, can quietly rival the compute bill it is meant to watch. And time-to-detect means the difference between a customer noticing a stuck parcel and an on-call engineer noticing first. A single-pane strategy satisfies all four only if it is built deliberately — not by clicking “enable integration” on three clouds and hoping the dashboards assemble themselves.

Why not the obvious shortcuts

Three tempting non-answers will be proposed in the first planning meeting, and each fails in a way worth naming.

“Just use each cloud’s native monitoring and put links in a wiki.” This is the status quo that produced the 51-minute incident. Native tools are excellent inside their own cloud and blind outside it; there is no shared trace that follows a request from the AWS routing service into the Azure billing call, and no single alert that fires on the combination. Correlation becomes a human tab-switching exercise during the exact moment humans are worst at it.

“Aggregate everything into a self-hosted stack — Prometheus plus Grafana plus Loki plus Tempo — and own it.” This is a legitimate architecture and cheaper in license terms, but it trades a license bill for an operations bill: you now run, scale, and shard a metrics store, a log store, and a trace store, federate Prometheus across clouds, and staff the team that keeps that pipeline alive at 3 a.m. For a company whose core competency is moving parcels, not running a time-series database, that is the wrong place to spend headcount.

“Pick one cloud’s tool and force the other two into it.” Azure Monitor will technically ingest AWS metrics through connectors, but it treats them as second-class, the tagging never lines up, and you have hitched your cross-cloud observability to a vendor with a structural incentive to make the other clouds look harder to operate. A neutral plane that treats all three as first-class is the point.

Datadog threads the needle by being cloud-agnostic by design: it pulls metrics and logs from each cloud’s own APIs through first-party integrations, accepts open-standard telemetry (OpenTelemetry, StatsD, Prometheus scrape) from anything those integrations miss, and unifies it all under one tag model, one query language, one alerting engine, and one incident workflow. The clouds stay where they are; Datadog supplies the shared lens.

Architecture overview

Datadog as the Single Pane of Glass for Multi-Cloud Operations — architecture

The platform has three logical layers that are useful to hold separately: a collection layer that gets telemetry out of each cloud, a Datadog platform layer that stores, correlates, and visualizes it, and an action layer that turns signal into a human being doing something. Telemetry flows up; alerts and identity flow down.

The defining design choice is the one that makes the whole thing coherent: a single, enforced tag taxonomy applied at the source across all three clouds. Every metric, log, and span carries env, service, cloud, region, team, and business_unit — the same keys, the same values, everywhere. Without this, a “single pane” is three panes sharing a URL. This taxonomy is not a Datadog feature you toggle; it is a contract enforced in your infrastructure-as-code, which is why Terraform appears so early below.

Collection layer, following the data flow per cloud:

AWS. The AWS integration assumes a cross-account IAM role and pulls CloudWatch metrics, plus inventory and tags, for hundreds of services without an agent. For richer signal, CloudWatch logs and events stream through a Datadog Forwarder Lambda (or Kinesis Firehose for volume), and EC2/ECS hosts run the Datadog Agent for process-level metrics and live tracing.
Azure. The Azure integration registers an Entra ID app (service principal) with Monitoring Reader across the relevant subscriptions and pulls Azure Monitor metrics and Activity Logs. AKS nodes and VMs run the Agent; Azure platform logs route via Event Hub to Datadog.
GCP. The GCP integration uses a service account with Monitoring Viewer/Compute Viewer to pull Cloud Monitoring metrics; logs ship through a Pub/Sub → Dataflow push to Datadog. GKE nodes run the Agent.
Kubernetes (all clouds). Each cluster runs the Datadog Agent as a Daemonset plus the Cluster Agent, deployed by Helm. They auto-discover pods, scrape Prometheus-annotated endpoints, tail container logs, and run the APM tracing library injected into application pods — so a distributed trace follows a tracking request from the GKE front end through the EKS routing service into the AKS billing call, as one waterfall.
Anything else — legacy virtual appliances, on-prem load balancers, network gear — emits via OpenTelemetry to the Datadog OTLP endpoint or the OpenTelemetry Collector with the Datadog exporter, so nothing is structurally excluded from the pane.

Platform layer. Datadog ingests all of the above, indexes metrics and (selectively) logs, stitches traces, and — critically — applies tag-based correlation so a single dashboard widget can group “p99 latency by service across all clouds” without caring where the data physically originated. Service Level Objectives are defined on top of these signals, monitors evaluate continuously, and Synthetics run scripted checks against customer-facing endpoints from outside every cloud.

Action layer. When a monitor breaches, Datadog routes through PagerDuty for human paging (on-call schedules, escalation policies) and opens or updates a record in ServiceNow for the auditable incident trail and change correlation. Identity into Datadog itself is brokered by Okta (or Entra ID) via SAML/SCIM, so access and team mapping match the rest of the org.

Component breakdown

Component	Service / tool	Role in the platform	Key configuration choices
AWS collection	Datadog AWS integration + Forwarder Lambda + Agent	Pull CloudWatch metrics/inventory; stream logs/events; host & APM metrics	Cross-account IAM role; namespace allow-list to control cost; ECS/EKS Agent
Azure collection	Datadog Azure integration + Entra app + Agent	Pull Azure Monitor metrics + Activity Log; AKS/VM host metrics	`Monitoring Reader` SP; Event Hub log route; per-subscription scoping
GCP collection	Datadog GCP integration + Pub/Sub log push + Agent	Pull Cloud Monitoring metrics; ship logs; GKE host metrics	`Monitoring Viewer` SA; Dataflow log sink; label-to-tag mapping
Kubernetes	Datadog Agent Daemonset + Cluster Agent (Helm)	Auto-discovery, log tailing, Prometheus scrape, APM tracing	Cluster Agent for scale; admission controller for trace lib injection
Open telemetry	OpenTelemetry Collector / OTLP, StatsD	Catch-all for appliances, legacy, custom app metrics	Datadog exporter; unified tag processor; tail-based sampling
Storage & correlation	Datadog platform (Metrics, Logs, APM)	Unified store, one query language, tag-based correlation	Logging-without-Limits (ingest vs. index split); retention tiers
Dashboards	Datadog Dashboards + Watchdog	Single-pane views; automated anomaly surfacing	Template variables on `env`/`cloud`/`service`; SLO widgets
Reliability targets	Datadog SLOs + Monitors	Honest customer-promise metric; burn-rate alerting	Metric- and monitor-based SLOs; multi-window burn-rate monitors
Black-box checks	Datadog Synthetics	External validation of customer endpoints from outside every cloud	Multistep API + browser tests; private locations for internal apps
Human paging	PagerDuty	On-call schedules, escalation, acknowledgement	Severity-mapped routing; event rules; bidirectional ack sync
ITSM record	ServiceNow	Auditable incident/change trail, post-incident review	Auto-create Incident on SEV; link to change records
Identity / SSO	Okta + Entra ID	SSO and team provisioning into Datadog	SAML login; SCIM team sync; group-to-role mapping
Secrets	HashiCorp Vault	Hold cloud integration creds, API/app keys for IaC	Dynamic leases; Agent secret backend; no keys in Helm values
Posture & runtime	Wiz + CrowdStrike Falcon	Cloud posture findings and runtime detections into Datadog	Wiz → Datadog event stream; Falcon detections as correlated signal
CI / IaC	GitHub Actions / Jenkins + Terraform / Ansible	Manage integrations, monitors, SLOs, dashboards as code	Datadog provider; CI deploy-tracking events; monitor-as-code review

A handful of these choices carry the weight of the design, and they are the ones teams get wrong.

Why agentless integrations and the Agent, not one or the other. The cloud integrations are agentless and effortless — assume a role, get hundreds of services’ metrics — but they read from CloudWatch/Azure Monitor/Cloud Monitoring, which means you inherit those platforms’ 1–5 minute granularity and you pay per CloudWatch API call. The Datadog Agent running on hosts gives 15-second granularity, process-level detail, live process and network maps, and APM tracing the cloud APIs cannot. The right answer is both: integrations for breadth across every managed service, Agents for depth on the workloads you actually operate. Treating it as either/or either blinds you to managed services or bankrupts you on CloudWatch API charges.

Why the tag taxonomy is enforced in IaC, not requested in a wiki. Each cloud names things differently — AWS tags, Azure tags, GCP labels — and humans tag inconsistently under deadline. If service is routing-svc on AWS and RoutingService on Azure, no dashboard groups them. The fix is to apply Unified Service Tagging (env, service, version) plus your business tags at provisioning time in Terraform, and to use Datadog’s tag-remapping in each integration to normalize the inevitable drift. The reconciliation table below is the kind of mapping you maintain deliberately:

Concept	AWS	Azure	GCP	Normalized Datadog tag
Environment	`Environment` tag	`env` tag	`env` label	`env`
Service	`Service` tag	`app` tag	`service` label	`service`
Region	AWS region	Azure location	GCP region	`region`
Cost owner	`CostCenter` tag	`costCenter` tag	`cost-center` label	`business_unit`

Why SLOs, not just CPU dashboards. The leadership question is “are we meeting our promise to customers,” and CPU graphs do not answer it. Define the Service Level Objective on the indicator the customer feels — for the tracking API, “99.5% of requests succeed and return in under 800 ms over 30 days” — and let Datadog compute the error budget and burn rate. This is what makes a single pane useful rather than merely unified: a row of SLO status widgets tells the VP at a glance whether the company is keeping its word, across all three clouds, on one screen.

Implementation guidance

Provision the integrations and the monitoring estate with Terraform — treat them as code, not console clicks. Observability config that lives only in a UI rots, drifts, and cannot be reviewed. The Datadog Terraform provider manages integrations, monitors, SLOs, and dashboards alongside the infrastructure they watch.

A minimal shape for the AWS integration and a multi-cloud SLO communicates the intent:

resource "datadog_integration_aws_account" "logistics_prod" {
  account_id = "210987654321"
  auth_config {
    aws_auth_config_role {
      role_name = "DatadogIntegrationRole"   # cross-account, read-only
    }
  }
  metrics_config {
    namespace_filters {
      include_only = ["AWS/ECS", "AWS/ApplicationELB", "AWS/RDS", "AWS/Lambda"]
    }                                          # allow-list to control CloudWatch cost
  }
  resources_config { cloud_security_posture_management_collection = false }
}

resource "datadog_service_level_objective" "tracking_api" {
  name        = "Tracking API availability"
  type        = "metric"
  description = "99.5% success over 30d across all clouds"
  query {
    numerator   = "sum:trace.http.request.hits{service:tracking-api,!http.status_class:5xx}.as_count()"
    denominator = "sum:trace.http.request.hits{service:tracking-api}.as_count()"
  }
  thresholds { timeframe = "30d"  target = 99.5  warning = 99.7 }
  tags = ["business_unit:tracking", "env:prod"]
}

Note the SLO query does not mention a cloud at all — it groups by service, and because every cloud stamps service:tracking-api identically, the objective spans AWS, Azure, and GCP transparently. That is the single pane paying off in one resource block.

The collection pipeline, cloud by cloud, with the gotchas.

AWS: create the cross-account DatadogIntegrationRole with a read-only policy, and immediately set a namespace_filters allow-list. The single most common cost surprise on this architecture is leaving every CloudWatch namespace enabled and paying for API calls on services you never look at. Stream high-value logs through the Forwarder Lambda; do not index everything.
Azure: register the Entra ID app, grant Monitoring Reader at the management-group level so new subscriptions are covered automatically, and route diagnostic logs through Event Hub. Scope tightly — an over-broad SP is a finding Wiz will (correctly) raise.
GCP: bind a service account with Monitoring Viewer and Compute Viewer, and stand up the Pub/Sub → Dataflow log push if you want GCP logs in the same pane; map GCP labels to Datadog tags in the integration config.
Kubernetes: deploy the Agent via the official Helm chart with the Cluster Agent enabled (it is what lets one deployment scale to thousands of pods without each node Agent hammering the API server), turn on APM and the admission controller so tracing libraries inject automatically, and enable Logging-without-Limits so you ingest all container logs but index only the ones worth querying.

Secrets and credentials: do not paste keys into Helm values. The cloud integration credentials, the Datadog API and app keys used by Terraform, and any third-party tokens live in HashiCorp Vault, leased dynamically and pulled by the Agent’s secret backend or injected into CI, so nothing sensitive sits in a values file, a pipeline variable, or — the lesson nobody wants to relearn — a git history. Login to Datadog itself federates through Okta over SAML with SCIM provisioning, so when someone joins or leaves the SRE team, their Datadog access and team membership change with their Okta record, not by a manual admin edit.

Synthetics: watch from outside, like a customer does. Internal metrics can be green while the customer-facing tracking page is down behind a broken CDN or DNS record. Configure Datadog Synthetics multistep API and browser tests against the public tracking endpoint from multiple managed locations worldwide, and private locations for internal apps, so you detect a customer-visible outage even when every backend dashboard looks healthy. Pair Synthetics with the Akamai layer: when an edge or origin-shield problem shows up, the Synthetic failure and the Akamai-side signal correlate on the same timeline.

Enterprise considerations

Security and access. A monitoring plane that can read every cloud is a high-value target, so treat it as one. Datadog access is SSO-only via Okta/Entra ID with SCIM-synced teams and least-privilege roles (read-only for most, monitor-edit for SRE, admin for a named few). Every cloud integration uses a read-only, scoped role or service principal — Monitoring Reader, not Contributor — and those credentials live in Vault. Fold security signal into the same pane rather than running it in a silo: Wiz posture findings (a public S3 bucket, an over-permissioned SP, a drifted ACL) flow to Datadog as events so an SRE triaging an incident sees the relevant misconfiguration on the same timeline, and CrowdStrike Falcon runtime detections on the Kubernetes nodes correlate against the workload metrics — a CPU spike that coincides with a Falcon detection is a very different page than a CPU spike alone. The goal is one timeline where reliability and security signals sit side by side, because at 3 a.m. an engineer should not be cross-referencing four tools.

Cost optimization. Observability spend is usage-based and grows with the estate, so engineer for it from day one or it surprises you on the next invoice.

Lever	Mechanism	Typical effect
Namespace allow-lists	Pull only CloudWatch/Monitor namespaces you use	Cuts integration metric + API cost sharply
Logging-without-Limits	Ingest all logs, index only what you query	60–80% of logs ingested but never indexed
Custom-metric hygiene	Control high-cardinality tags; drop unused metrics	Custom metrics are the top runaway cost line
Trace sampling	Tail-based sampling keeps errors + slow traces, drops the rest	Retains signal at a fraction of span volume
Tiered retention	Short retention for noisy logs, long for audit-relevant	Pays for retention only where it earns its keep

The discipline that matters most: custom metrics and high-cardinality tags are the line item that quietly explodes. A single metric tagged with user_id or shipment_id can mint millions of time series. Govern tag cardinality in the same IaC review that governs everything else, and watch your own Datadog usage in Datadog — the platform meters itself.

Scalability. Each layer scales independently. Agentless integrations scale with API quotas — mind CloudWatch GetMetricData limits and request quota increases before a big onboarding. The Cluster Agent is the key to Kubernetes scale: it centralizes cluster-level queries so node Agents do not each pummel the API server, letting one deployment cover thousands of pods. Ingestion scales on Datadog’s side; your job is to govern what you send (sampling, allow-lists, index filters) rather than to scale a backend you do not run — which is precisely the operational burden the self-hosted alternative would have handed you.

Failure modes, and what each one looks like. Name them before they page you.

Integration credential expiry or revocation — an Entra app secret rotates, the Azure integration silently stops pulling, and a whole cloud goes dark on the dashboard with no error. Mitigation: monitor the integration’s own health metric, lease credentials from Vault with managed rotation, and alert on “no data” for each cloud.
Agent version skew across clusters — different Agent versions across EKS/AKS/GKE produce subtly different tags or missing features, fracturing correlation. Mitigation: pin and roll Agent versions through Argo CD or the Helm pipeline so all clusters move together.
The monitoring monoculture — Datadog itself has an incident and you are blind during the exact window you most need to see. Mitigation: keep Synthetics from independent locations and a thin secondary signal (cloud-native alarms on the two or three most critical SLIs) as a backstop, and define an explicit “observability is down” runbook.
Alert storms — a shared dependency fails and 200 monitors fire at once, burying the one that matters. Mitigation: composite monitors, dependency-aware grouping, and PagerDuty event rules that suppress downstream noise.
Tag drift — a new service ships with service:RoutingSvc instead of routing-svc and falls out of every dashboard and SLO. Mitigation: enforce Unified Service Tagging in Terraform and fail CI on missing required tags.

Observability of the observability — alert routing that respects humans. Map monitor severity to action deliberately: a SEV-1 SLO burn pages on-call through PagerDuty with full escalation and auto-opens a ServiceNow incident for the auditable trail and post-incident review; a SEV-3 posts to a Slack channel and opens a low-priority ServiceNow ticket; an informational anomaly from Watchdog (Datadog’s automated anomaly detection) just annotates the dashboard. Bidirectional sync matters — acknowledging in PagerDuty should reflect in Datadog, and resolving the Datadog monitor should close the loop in ServiceNow — so the three tools tell one story rather than three. Tie deployments in too: emit a deploy-tracking event from GitHub Actions (or Jenkins) on every release so a latency regression on the dashboard lines up against the exact deploy that caused it, turning “when did this start” from an investigation into a glance.

Governance. Manage monitors, SLOs, and dashboards as code through the Datadog Terraform provider, reviewed in pull requests like any other change, so an alert threshold cannot be quietly weakened without a reviewer seeing it. Standardize dashboards with template variables on env, cloud, and service so one dashboard serves every team rather than each team forking their own. Configure cloud resources and Agent rollouts with Ansible where Terraform is not the right fit (host-level Agent config, appliance onboarding). And keep the tag taxonomy itself versioned and reviewed — it is the schema the entire single pane depends on, and it deserves the same rigor as a database migration.

Explicit tradeoffs

Accept these or do not build it. A single pane on Datadog means a meaningful, usage-based vendor bill and a real degree of vendor lock-in — your dashboards, monitors, and SLO definitions are expressed in Datadog’s model, and migrating off is non-trivial. You inherit the granularity and cost characteristics of each cloud’s metric API for the agentless portion. You take on a genuine cost-governance discipline — custom metrics, log indexing, and trace volume will grow with success and must be actively managed, not set and forgotten. And you accept a monitoring monoculture risk that has to be explicitly mitigated with independent Synthetics and a thin native-alarm backstop. The reconciliation work — forcing three clouds’ tagging into one taxonomy — is real engineering, not a toggle, and it never fully ends because new services keep arriving.

The alternatives, and when they win. If your estate is genuinely single-cloud, the native tool (CloudWatch, Azure Monitor, Cloud Monitoring) is cheaper, deeply integrated, and the cross-cloud unification Datadog sells you is value you would not use. If you have the platform-engineering headcount and a strong cost incentive, a self-hosted Prometheus + Grafana + Loki + Tempo + Mimir stack gives you control and a lower license bill in exchange for the operational burden of running it — a fair trade for a company whose core competency is infrastructure. If you only need black-box uptime and a status page, a lightweight synthetic-monitoring-only service is far simpler than a full observability platform. And Dynatrace is the closest peer to Datadog in this multi-cloud single-pane role — its automatic dependency mapping and AI root-cause analysis are genuinely strong — so the choice between them often comes down to existing relationships, APM depth needs, and pricing rather than capability gaps. Datadog earns the slot here because the breadth of first-party integrations, the SLO and incident tooling, and the synthetic-plus-APM-plus-logs unification line up exactly with a three-cloud company that needs one screen and does not want to run a database to get it.

The shape of the win

For the logistics company’s SRE org, the payoff is not “nicer dashboards.” It is that the next time a customs-clearance slowdown on Azure starts to back up routing on AWS, one engineer looking at one dashboard sees both clouds’ latency climbing on the same timeline, the burn-rate monitor on the cross-cloud tracking SLO pages the right on-call through PagerDuty inside two minutes instead of fifty-one, the deploy-tracking event rules out a release as the cause in one glance, and the ServiceNow incident writes itself for the post-mortem — all because every cloud speaks the same tag language into the same plane. The 51-minute correlation gap that started this project closes to single-digit minutes, and the VP gets the one thing actually asked for: a single screen that answers, honestly and across every cloud, “are we keeping our promise to customers.” Everything upstream — the agentless integrations, the Agents on Kubernetes, the OpenTelemetry catch-all, the Okta SSO, the Vault-held credentials, the Wiz and Falcon signals folded into the same timeline — exists to make that one screen true.

Datadog as the Single Pane of Glass for Multi-Cloud Operations

Why not the obvious shortcuts

Architecture overview

Component breakdown

Implementation guidance

Enterprise considerations

Explicit tradeoffs

The shape of the win

Written by Vinod

Comments

Keep Reading

The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active

The Azure Architecting Ladder: From a Simple Web App to Mission-Critical

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)