It’s a Tuesday at a mid-sized online grocery retailer, and the “Place Order” button has started spinning forever. Not for everyone — about one shopper in twenty, seemingly at random — but it’s peak dinner-planning hour, baskets worth real money are being abandoned, and the support inbox is filling with screenshots. The on-call engineer opens the dashboard she has, sees the web tier is “green,” and is now completely stuck. CPU looks fine. The page that serves the button returns HTTP 200. And yet checkout is broken. She SSHes into a server, starts grep-ing through a log file, finds nothing useful, and forty minutes later the incident is still open because she has data but no way to ask it a question.
That gap — between collecting telemetry and being able to interrogate it — is the difference between monitoring and observability, and it’s what this article is about. Monitoring answers questions you decided to ask in advance: “is CPU above 80%?” Observability is the property of a system that lets you ask new questions about its behavior — questions you didn’t anticipate — from the data it already emits. You get there by collecting three complementary kinds of signal, the three pillars: logs, metrics, and traces. None of them is sufficient alone. Together they let our grocery engineer go from “checkout is slow for some people” to “the payment service’s connection pool to the fraud-check API is exhausted, here’s the exact request, here’s the line of code” in minutes instead of an afternoon.
This is a foundational piece. By the end you’ll know what each pillar is, what it’s good and bad at, where the cloud-native tools (CloudWatch, Azure Monitor, Cloud Monitoring) and the big platforms (Datadog, Dynatrace) sit, and how to write your first SLO and your first sane alert — the kind that wakes someone up only when a customer is actually hurting.
The three pillars, in plain terms
Start with intuition before tooling. Imagine one shopper clicking “Place Order.” That single click fans out into a dozen things happening across your systems, and each pillar captures a different view of those things.
A log is a timestamped record of a discrete event. “At 19:43:07, order 8812 failed payment authorization: connection timeout.” Logs are the detailed story — high information per line, often human-readable, great for why something specific happened. Their weakness is volume and cost: emit a line per request at scale and you drown in terabytes, and you can’t easily answer “how often does this happen” without scanning everything.
A metric is a number measured over time. “Orders failed per minute: 14.” “p95 checkout latency: 2.3 seconds.” Metrics are cheap to store because they’re pre-aggregated — a counter is just a number ticking up — so they’re perfect for dashboards, trends, and alerting. Their weakness is the mirror image of logs: a metric tells you that failures spiked, never which order or why. You lose all the detail in exchange for being able to keep years of history affordably.
A trace follows one request as it travels across services. Our shopper’s click enters the web frontend, calls the order service, which calls the payment service, which calls an external fraud API — a trace stitches all of those hops into a single timeline, showing how long each step took and where it stalled. Traces are the pillar that answers where in a distributed system the time went or the error originated. Their weakness is that they’re heavier to instrument and you usually sample them (keep 1-in-N) to control cost, so a trace is your scalpel for latency and cross-service problems, not your exhaustive ledger.
| Pillar | What it is | Best at answering | Weak at | Typical cost driver |
|---|---|---|---|---|
| Logs | Timestamped event records | Why did this specific thing happen | Trends, aggregate rates, high volume | Ingestion + retention (GB) |
| Metrics | Numbers over time | What and how much, trends, alerting | Per-event detail, root cause | Number of time series (cardinality) |
| Traces | One request across services | Where did the latency/error occur | Completeness (sampled) | Spans ingested |
The reason you need all three: metrics tell you a customer is hurting (failures spiked), traces tell you where (the payment → fraud-API hop), and logs tell you exactly why (connection pool exhausted on host X). Skip one and you’ll feel the gap on your next incident.
Architecture overview
Here’s how the pieces fit together for our grocery retailer running across a couple of clouds (they’re on AWS for the storefront and Azure for the warehouse/logistics side, which is more common than you’d think and is exactly why a vendor-neutral story matters).
The flow is the same everywhere, and it has four stages: emit → collect → store → use.
-
Emit. Every service produces telemetry. Application code writes structured logs (JSON, not plain strings — more on that below), increments metric counters, and — once instrumented with OpenTelemetry (OTel), the vendor-neutral standard for telemetry — emits trace spans. OTel matters enormously to a beginner’s career: you instrument your code once against the OTel API, and you can send that data to CloudWatch, Datadog, or Dynatrace by changing configuration, not code. It’s the thing that keeps you from being locked to one vendor.
-
Collect. A lightweight agent or collector runs alongside your workloads — the OTel Collector, the CloudWatch agent, the Azure Monitor Agent, or a vendor agent like the Datadog Agent or Dynatrace OneAgent — and scrapes/receives the three signals, batches them, adds metadata (which host, which container, which Kubernetes pod), and forwards them on. This layer is also where you control cost: drop noisy debug logs, sample traces, filter metrics you’ll never query.
-
Store. The signals land in backends suited to each shape. In the cloud-native path, logs and metrics go to Amazon CloudWatch (Logs and Metrics) on the AWS side and Azure Monitor (Log Analytics workspace for logs, Azure Monitor Metrics for metrics) on the Azure side; traces go to AWS X-Ray or Application Insights. On Google Cloud the equivalent is Cloud Monitoring (metrics) plus Cloud Logging and Cloud Trace. In the platform path, everything funnels into one place — Datadog or Dynatrace — which gives you correlated logs, metrics, and traces under a single pane of glass.
-
Use. On top of storage sit the four things you actually do with telemetry: dashboards (the shared view of system health), alerting (automatically paging a human when an SLO is at risk), querying/exploration (the ad-hoc “let me ask a new question” that is observability), and automation — for our retailer, a fired alert opens an incident in ServiceNow, the IT service-management platform, so there’s a tracked ticket, an owner, and a paper trail rather than a Slack message that scrolls away.
The control plane around all of this matters too, even at junior level. The agents and dashboards are deployed as code: Terraform provisions the CloudWatch alarms, the Azure Monitor action groups, and the Datadog monitors so they’re version-controlled and reviewable, while Ansible (or a container base image) installs and configures the agents consistently on every host. And nobody logs into these tools with a local password — access is brokered through your identity provider, Okta or Microsoft Entra ID, via single sign-on, so an engineer who leaves the company loses dashboard and alerting access the moment HR disables their account.
Walking one incident through the architecture
Theory sticks better against the Tuesday-night checkout incident. Here’s the data flow, pillar by pillar, that turns a 40-minute fishing expedition into a 5-minute fix.
The metric fires first. A dashboard panel — backed by a CloudWatch metric OrdersFailed — crosses its threshold: failures-per-minute jumps from ~0 to 14. Because there’s an alert wired to the SLO (defined below), the on-call’s phone buzzes and a ServiceNow incident opens automatically with the metric snapshot attached. This is monitoring doing its job: a question we asked in advance (“are orders failing more than our budget allows?”) got answered.
The trace localizes it. The engineer opens the trace view (X-Ray, or the trace tab in Datadog) and filters to failed checkout requests. Every failing trace shows the same shape: web frontend → order service (fast) → payment service, where the span sits for 30 seconds and then errors. The order-service and frontend spans are green; the time is all in the payment → external-fraud-API hop. Now she knows where. Notice she did not know to ask “is the fraud API slow?” in advance — the trace let her discover it. That’s observability.
The log explains it. She pivots from the slow span straight to the payment service’s logs for that exact trace ID — because the logs are structured and carry the trace_id, this is one click, not a grep. The log line reads: connection pool exhausted: max 20, waiting. Root cause found: the fraud API got slow, payment-service requests piled up waiting for a connection, the pool of 20 filled, and new checkouts couldn’t even start. The fix (raise the pool, add a timeout and a circuit breaker so a slow dependency fails fast instead of hanging) is now obvious.
That three-step dance — metric to notice, trace to localize, log to explain — is the everyday loop of operating a system well. The whole architecture exists to make those three pivots fast and correlated.
Where the cloud-native tools fit (and where the platforms do)
A fair question for someone starting out: if AWS, Azure, and Google each ship a built-in observability stack, why does anyone pay for Datadog or Dynatrace? Both answers are legitimate; here’s the honest tradeoff.
The cloud-native tools — CloudWatch, Azure Monitor, Cloud Monitoring — are right there, already integrated, with nothing extra to install for first-party services, and they’re cheap to start. CloudWatch automatically has metrics for your Lambda functions and load balancers; Azure Monitor already sees your App Service and VMs. For a single-cloud shop, or a team early in its journey, this is genuinely the correct place to begin. The catch is that each is siloed to its own cloud and the cross-signal correlation (jump from this metric to that trace to those logs) ranges from clunky to absent. Our retailer feels this acutely: storefront telemetry in CloudWatch and logistics telemetry in Azure Monitor means no single view, and a checkout problem that spans both clouds is painful to chase.
The observability platforms — Datadog and Dynatrace — exist to solve exactly that. They ingest all three pillars from every environment (both clouds, on-prem, Kubernetes) into one correlated place, so the metric-to-trace-to-log pivot above is a single click across the whole estate. Dynatrace leans into automation: its OneAgent auto-discovers your services and dependencies, and its “Davis” AI proposes a root cause rather than making you hunt. Datadog is famously broad and fast to adopt, with hundreds of turnkey integrations. What you pay for that is, well, the bill — these platforms are priced per host / per GB / per million spans and can get expensive at scale — plus a new vendor relationship.
| Approach | Examples | Strengths | Watch-outs |
|---|---|---|---|
| Cloud-native | CloudWatch, Azure Monitor, Cloud Monitoring | Built-in, cheap to start, zero extra agents for first-party services | Siloed per cloud; weaker cross-signal correlation |
| Observability platform | Datadog, Dynatrace | One correlated view across all clouds; rich tracing; AI-assisted root cause | Cost at scale; another vendor; agent to deploy |
There’s no universally right choice. A pragmatic path many teams take: start on the cloud-native tools, adopt OpenTelemetry for instrumentation from day one so you’re not locked in, and graduate to a platform when multi-cloud correlation or scale makes the manual pivots too slow — which is precisely the bind our grocery retailer is now in.
Your first SLO, and an alert that doesn’t lie
Here’s the single most important habit to build early: alert on customer symptoms, not on machine internals. The Tuesday incident is the proof. The web servers were “green” on CPU and memory the whole time — alerting on those would have stayed silent while checkout burned. What was actually broken was something a customer could feel: orders failing. So that’s what you measure and alert on.
The discipline for this is the SLO (Service Level Objective), borrowed from Google’s SRE practice. Three terms, kept simple:
- An SLI (Service Level Indicator) is the thing you measure — e.g. “the percentage of checkout requests that succeed in under 3 seconds.”
- An SLO is the target for that indicator — e.g. “99.5% of checkouts succeed in under 3 seconds, measured over 28 days.”
- The error budget is what’s left over: 100% − 99.5% = 0.5% of checkouts are allowed to fail. That budget is a feature, not a bug — it tells you when to stop shipping risky changes (budget spent) versus when you have room to move fast (budget healthy).
Why an SLO beats a raw threshold: “alert if any checkout fails” pages you constantly over blips no one noticed; “alert if CPU > 80%” pages you when nothing is actually wrong. An SLO-based alert fires when you’re burning your error budget fast enough to miss the objective — i.e. when customers are genuinely being hurt at a rate that matters. That’s the alert worth a 3 a.m. wake-up; everything else is a dashboard you look at in the morning.
Concretely, in CloudWatch this is an alarm on a customer-facing metric, not a host metric:
# CloudWatch alarm: page when checkout success dips below the SLO.
# Math metric: success_rate = successful_orders / total_orders * 100
AlarmName: checkout-slo-burn
Metric: CheckoutSuccessRatePct # derived from OrdersTotal & OrdersSucceeded
ComparisonOperator: LessThanThreshold
Threshold: 99.5 # the SLO target
EvaluationPeriods: 5 # sustained, not a single blip
Period: 60 # 1-minute buckets
TreatMissingData: notBreaching
AlarmActions:
- arn:aws:sns:...:pager-and-servicenow # page on-call + open a ticket
The same idea expresses cleanly in a platform. A Datadog monitor on a trace-derived metric reads almost like the sentence you’d say out loud:
avg(last_5m):
( sum:checkout.requests.success{*} / sum:checkout.requests.total{*} ) * 100
< 99.5
Two rules to internalize so your first alerts are good ones. Make alerts actionable — every page should mean “a human must do something now”; if there’s nothing to do, it’s a dashboard, not an alert. And tie the page to an SLO, then route it — the fired alert opens a ServiceNow incident (owner, severity, audit trail) rather than dying in a chat channel. Get these two right and you avoid the career-shortening misery of alert fatigue, where so many false pages fire that the team stops trusting all of them — including the one that mattered.
Practical gotchas worth knowing on day one
A few realities that bite beginners, and the fixes:
Structure your logs. Write JSON, not free-text. {"level":"error","trace_id":"abc","msg":"pool exhausted","pool_max":20} is queryable and joins to traces; "ERROR pool exhausted (max 20)" forces you back to grep. The single highest-leverage habit here is to stamp the trace_id on every log line, because that’s the thread that lets you pivot from a slow span straight to the explaining log — the move that saved our engineer.
Cardinality is the metrics cost trap. A metric’s cost scales with the number of unique label combinations (time series), not the number of data points. Tagging a metric with customer_id when you have two million customers silently creates two million series and an enormous bill. Tag with bounded things — region, service, status_code — and keep unbounded identifiers in logs and traces, where they belong.
Sample traces deliberately. Tracing every request at scale is expensive and rarely necessary. Sample (keep 1-in-N, or use tail-based sampling that keeps all the errors and slow requests and a fraction of the boring fast ones) so you pay for the traces you’ll actually look at.
Security and access aren’t an afterthought. Telemetry contains sensitive operational detail and sometimes accidental PII, so scrub secrets before they hit logs, restrict who can read them, and put dashboards and alerting tools behind SSO via Okta or Entra ID so access follows employment. (Don’t overload these signals for compliance/audit logging either — that’s a separate, tamper-evident pipeline, even though it superficially looks like “just more logs.”)
What to remember
Observability isn’t a tool you buy; it’s a property you build, and it rests on three pillars that each cover the others’ blind spots. Metrics are cheap and tell you what and how much — they drive your dashboards and your alerts. Traces follow one request across services and tell you where the time or the error went. Logs are the detailed record that tells you why a specific thing happened. The everyday loop of running a system well is the pivot between them: a metric to notice, a trace to localize, a log to explain.
For a beginner, the actionable starting point is small and concrete. Instrument with OpenTelemetry so you’re never locked to a vendor. Begin on the cloud-native stack you already have — CloudWatch, Azure Monitor, or Cloud Monitoring — and reach for a platform like Datadog or Dynatrace when correlation across environments outgrows the built-ins, exactly as our grocery retailer’s two-cloud split now demands. Write one SLO on something a customer actually feels, alert only when its error budget is burning, and route that alert into ServiceNow so it becomes a tracked incident, not a lost message. Do those few things and the next time the “Place Order” button spins forever, you’ll be the engineer who closes the incident in five minutes — because you can finally ask your system a question and get an answer.