AWS Lesson 75 of 123

AWS Observability, In Depth: CloudWatch, CloudTrail, Config & EventBridge

When something goes wrong in AWS at three in the morning, three questions decide how quickly you recover. What is broken right now? — a metric is in alarm, a queue is backing up, latency has tripled. Who changed something? — somebody, or some automation, touched a resource and the timeline matters. When did the configuration drift away from what it should be? — a security group opened, an S3 bucket lost its encryption, an IAM policy widened. AWS answers those three questions with three different services, and the single most useful mental model in all of AWS observability is to keep them straight: CloudWatch tells you what is happening, CloudTrail tells you who did what, and AWS Config tells you what the configuration is and how it changed over time. Tie them together with EventBridge, which turns any of those signals into automated action, and you have a complete observability and governance loop.

This lesson is deliberately exhaustive. Observability is one of the most heavily examined and most operationally important areas of AWS, and it is also where engineers most often have a fuzzy, incomplete picture — they know CloudWatch shows graphs and CloudTrail shows API calls, but cannot explain a composite alarm, a metric filter, a Config conformance pack, or why CloudWatch Events and EventBridge are the same thing under two names. We go through each service with the same treatment used across this course: what it is · the choices · the default · when to use it · the trade-off · the limits · the cost impact · the gotcha. Every core operation comes with a real aws CLI command so you can reproduce it by hand, and because this is a reference you will return to mid-incident, every concept, limit, option and failure mode is also laid out as a scannable table — read the prose once, then keep the tables open when the pager fires.

By the end you will be able to instrument a workload with metrics, alarms, logs and dashboards; query logs at scale with Logs Insights; record an audit trail of every API call with CloudTrail; track and enforce configuration with AWS Config; wire it all into automated remediation with EventBridge; and add distributed tracing with X-Ray. Enough to ace an SOA-C02 or SAA-C03 question, hold your own in an interview, and run a production account you can actually see into.

What problem this solves

Without an observability strategy you are flying blind. The workload runs, until it does not, and when it does not the only signals you have are a customer complaint and a blank stare at the console. The pain is concrete: you cannot tell whether the latency spike is the database or the app; you cannot prove who deleted the security group that took prod down; you cannot say what the IAM policy looked like before it was widened; and you have no automated way to catch a public S3 bucket the moment it is created instead of in next quarter’s audit. Every one of those is a different question, and reaching for the wrong service — grepping CloudWatch for “who deleted this”, or expecting CloudTrail to show you a resource’s state last Tuesday — burns the hour you do not have.

What breaks without it: incidents run long because nobody can localise the fault; security findings surface weeks late; compliance audits become archaeology; and cost quietly balloons because logs default to never expire and high-cardinality custom metrics multiply unwatched. Who hits this: every team running anything in AWS past a single toy instance, and hardest the teams running multi-account organisations where signal is scattered across Regions and accounts with no single pane of glass.

To frame the whole field before the deep dive, here is the question each service answers, the signal it produces, and the first place you look:

Question in the incident Service that answers it Signal it produces First place to look The classic mistake
What is broken right now? CloudWatch Metrics, alarms, logs, dashboards Alarms list / dashboard for the Region Looking in the wrong Region (empty graph)
Who did what, and when? CloudTrail Every API call: identity, action, IP, result Event history (90-day, free) Expecting data-plane reads (off by default)
What is the config, and how did it change? AWS Config Configuration-item timeline + compliance Resource timeline / rule compliance Confusing it with CloudTrail’s events
How do I react automatically? EventBridge Event match → target (Lambda/SSM/SNS) Rule pattern + target wiring Pointing AWS events at a custom bus
Where did the time go in one request? X-Ray Service map + per-request trace Trace map for the slow operation Expecting every request (it samples)

Learning objectives

By the end of this lesson you can:

Prerequisites & where this fits

You should already be comfortable with the AWS basics — the Management Console and the aws CLI with a configured profile (covered in AWS Console, CLI, CloudShell & SDK First Steps), Regions and IAM roles/policies (see IAM Fundamentals: Users, Roles, Policies & Evaluation), and at least one service you can generate signal from (an EC2 instance, a Lambda function, or an S3 bucket). No prior monitoring experience is assumed; every term is defined. This is the Observability lesson of the AWS Zero-to-Hero course’s Foundation/Intermediate track, and it is the anchor the operational lessons build on: troubleshooting playbooks, frontend SLO monitoring, and structured logging pipelines all reference the metrics, alarms, logs and trails introduced here.

A quick map of where each piece sits and what depends on it, so you can see the shape before the detail:

Layer Service(s) Scope Depends on Built on top of it
Telemetry collection CloudWatch metrics + agent Per-Region IAM role on the source Alarms, dashboards, SLOs
Log storage & query CloudWatch Logs + Insights Per-Region Log group + retention Metric filters, subscriptions
Audit of API calls CloudTrail Multi-Region capable S3 bucket, optional KMS Athena/Lake, security alarms
Configuration state AWS Config Per-Region (global once) S3 + recorder Rules, packs, aggregators
Event routing EventBridge Per-Region (default bus) Source events Remediation, fan-out
Distributed tracing X-Ray Per-Region, sampled Instrumentation Application Signals / SLOs

Core concepts

Before any console blade, fix five mental models. They explain why these services are shaped the way they are.

Observability is signal plus the ability to ask new questions. Monitoring answers questions you decided to ask in advance (an alarm you pre-wired). Observability is being able to ask questions you did not anticipate — slicing logs by a field you did not pre-aggregate, correlating a latency spike with a deploy, tracing one slow request across five services. CloudWatch metrics and alarms are the monitoring half; Logs Insights, CloudTrail and X-Ray are what give you the observability half.

The three pillars: metrics, logs, traces. A metric is a number over time (CPU %, request count, queue depth) — cheap to store, fast to alarm on, but aggregated, so it tells you that something is wrong, not why. A log is a timestamped record of an event (a line of text or JSON) — rich and detailed, the why, but expensive at volume and slower to query. A trace follows a single request as it hops between services, showing where the time went. CloudWatch covers metrics and logs; X-Ray covers traces; all three live under the CloudWatch umbrella in the console today.

The who / what / when triad. Keep these three audit-and-observe questions separate because three different services answer them:

Question Service What it records Retention default Real-time?
What is happening / happened? CloudWatch Metrics, alarms, logs, dashboards — operational health Metrics 15 mo; logs never-expire Near real time
Who did what, and when? CloudTrail Every API call: identity, action, source IP, parameters, result 90-day history; trails as configured ~5–15 min to S3
What is the config and how did it change? AWS Config Resource configuration snapshots + a timeline of changes + compliance Until you stop the recorder Minutes after change

An interviewer’s favourite trap is “I need to know who deleted this security group” (CloudTrail, not CloudWatch) versus “I need to know what my security group looked like last Tuesday and what changed” (AWS Config, not CloudTrail). CloudTrail tells you the event; Config tells you the state over time.

Everything is regional, with a few global exceptions. CloudWatch metrics, alarms, log groups, Config recorders and EventBridge buses are per-Region — a metric published in ap-south-1 does not appear in us-east-1. CloudTrail can be multi-Region (one trail captures all Regions) and some global services (IAM, CloudFront, Route 53) log only to us-east-1. This regionality is the single most common cause of “my alarm/dashboard is empty” — you are looking in the wrong Region.

Push vs pull, and the agent. AWS services push their own metrics to CloudWatch automatically (EC2 CPU, ELB request count, Lambda invocations) at no charge for the default set. But CloudWatch cannot see inside an EC2 instance — memory and disk usage are not default metrics because the hypervisor cannot see the guest OS. To get those, and to ship the instance’s log files, you install the CloudWatch agent inside the OS. This push model and the agent gap are exam-classic.

The vocabulary in one table

Pin down every moving part before the deep sections; the glossary repeats these for lookup, but this is the mental model side by side:

Term One-line definition Where it lives Why it matters
Metric A time-ordered series of numeric data points CloudWatch (per-Region) The cheap “that something is wrong” signal
Namespace Container grouping related metrics Metric identity AWS/ is reserved; yours is anything
Dimension Name/value pair scoping a metric to a resource Metric identity Each combo is a distinct billable metric
Alarm A watcher on a metric with states + actions CloudWatch Turns a metric into a page or an action
Log group / stream Container / per-source sequence of log events CloudWatch Logs Retention & filters live on the group
Metric filter Pattern that turns matching logs into a metric On a log group Alarm on “N errors/min” without parsing
Trail Config that delivers CloudTrail events to S3 CloudTrail The durable, queryable audit record
Management / data event Control-plane / data-plane API activity CloudTrail Data events are off by default and cost
Configuration item (CI) Point-in-time snapshot of a resource AWS Config The unit you are billed per, and queried by
Config rule Desired-state check → COMPLIANT/NON_COMPLIANT AWS Config Detects drift; does not fix it
Conformance pack Bundle of Config rules + remediation AWS Config Deploy a whole standard at once
Event bus / rule The pipe / the matcher routing events to targets EventBridge Turns any signal into automation
Segment / trace One service’s work / one request’s full path X-Ray Shows which hop is slow

CloudWatch metrics, in depth

A metric is the fundamental CloudWatch concept: a time-ordered set of data points, each a number with a timestamp, identified by a namespace and zero or more dimensions.

Namespaces. What: a container that groups related metrics so names do not collide. AWS service metrics use the AWS/<service> convention (AWS/EC2, AWS/Lambda, AWS/ApplicationELB, AWS/RDS, AWS/SQS). Your custom metrics go in any namespace you choose (e.g. MyApp/Checkout). Gotcha: the AWS/ prefix is reserved — you cannot publish into it.

The service namespaces you will reach for most, the dimension that scopes them, and a signal worth alarming on in each:

Namespace Service Key dimension A metric to watch Why it matters
AWS/EC2 EC2 instances InstanceId CPUUtilization, StatusCheckFailed Health + recover trigger
AWS/Lambda Lambda FunctionName Errors, Throttles, Duration Failures + concurrency limits
AWS/ApplicationELB ALB LoadBalancer HTTPCode_Target_5XX_Count, TargetResponseTime Backend errors + latency SLO
AWS/RDS RDS / Aurora DBInstanceIdentifier CPUUtilization, FreeableMemory, DatabaseConnections DB saturation
AWS/SQS SQS QueueName ApproximateAgeOfOldestMessage Backlog / stuck consumer
AWS/DynamoDB DynamoDB TableName ThrottledRequests, ConsumedReadCapacityUnits Capacity / hot partition
AWS/ApiGateway API Gateway ApiName 5XXError, Latency, Count API health
CWAgent CloudWatch agent InstanceId, path mem_used_percent, disk_used_percent The memory/disk gap

Dimensions. What: name/value pairs that scope a metric to a specific resource — e.g. the AWS/EC2 CPUUtilization metric carries an InstanceId dimension so each instance has its own line. Choices: up to 30 dimensions per metric; each unique combination of namespace + name + dimensions is a distinct metric (this is what you are billed for as a custom metric). Gotcha: dimensions are part of the metric’s identity — if you publish CPUUtilization with InstanceId=i-abc and also without any dimension, those are two different metrics, and CloudWatch does not auto-aggregate across dimensions for custom metrics.

Statistics. What: how data points in a period are aggregated for display/alarming. Choices: Average, Sum, Minimum, Maximum, SampleCount, and percentiles (p50, p90, p99, or any pNN.NN) and trimmed means. When: use Average for utilisation, Sum for counts (requests, errors), Maximum for “did it ever spike”, and percentiles for latency SLOs (a p99 latency alarm catches the slow tail an average hides). Gotcha: percentiles need raw samples — they do not work on metrics already pre-aggregated as statistic sets unless you publish the full distribution.

Pick the statistic that matches the question, not out of habit:

Statistic What it answers Best for Trap if misused
Average Typical value over the period CPU/memory utilisation Hides spikes that page you
Sum Total over the period Request count, errors, bytes Meaningless for a gauge like CPU%
Maximum Worst point in the period “Did it ever breach?” heartbeats One blip looks like sustained load
Minimum Best point in the period Free-capacity floors Rarely what you alarm on
SampleCount How many data points landed Detecting a metric going silent Not the value, just the count
p90 / p99 The slow tail latency Latency SLOs, user experience Needs raw samples, not stat sets

Resolution. What: how granular the data points are. Choices: standard resolution = 1-minute granularity (the default for AWS service metrics); high resolution = down to 1-second granularity for custom metrics. When: high resolution for fast-moving signals where a one-minute average smooths over a problem (e.g. a spiky request rate, autoscaling on sub-minute bursts). Cost/trade-off: high-resolution alarms can evaluate at 10-second periods but cost more and high-resolution data points cost more to publish. Gotcha: a high-resolution alarm period below 60 s costs more per alarm.

Resolution Granularity Who uses it Alarm period floor Cost note
Standard (default) 1 minute AWS service metrics, most custom 60 s Included for default AWS metrics
Detailed monitoring (EC2) 1 minute (vs 5) EC2 you want finer 60 s Per-instance charge; no new metrics
High resolution 1 second Spiky custom metrics 10 s Higher per-metric + per-alarm cost

Retention (automatic, free, and you cannot change it). CloudWatch keeps metric data at decreasing granularity and discards it after 15 months:

Original period Retained as For
< 60 s (high-res) 1-second data points 3 hours
60 s (1 min) 1-minute data points 15 days
5 min 5-minute data points 63 days
1 hour 1-hour data points 15 months

Gotcha: after 15 months metrics are gone — if you need longer retention for capacity planning or compliance, export to S3 (via metric streams) or store a copy yourself.

Custom metrics. What: numbers your own application or scripts push to CloudWatch with PutMetricData. When: business and in-app signals CloudWatch cannot see — orders per minute, cache hit ratio, queue processing lag, memory usage. Cost: billed per custom metric per month (per unique namespace+name+dimensions combination), plus per-API-request charges; this adds up fast if you publish a metric per user or per request — use dimensions thoughtfully. Gotcha: PutMetricData accepts timestamps up to two weeks in the past and up to two hours in the future; outside that the point is rejected.

# Publish a custom metric
aws cloudwatch put-metric-data \
  --namespace "MyApp/Checkout" \
  --metric-name OrdersProcessed \
  --unit Count --value 42 \
  --dimensions Environment=prod,Service=cart

Metric math and search expressions. What: compute new time series from existing ones — errors / requests * 100 for an error rate, SUM across instances, anomaly-detection bands. Search expressions (SEARCH('{AWS/EC2,InstanceId} CPUUtilization', 'Average')) match metrics dynamically so a graph or alarm auto-includes new instances. When: fleet-wide dashboards and ratio alarms. Gotcha: you cannot alarm directly on a raw search expression result unless wrapped appropriately; metric-math alarms are supported.

The CloudWatch agent. What: a single binary (amazon-cloudwatch-agent) you install on EC2 or on-premises servers to collect OS-level metrics CloudWatch cannot see (memory, disk space, disk/network I/O, swap, per-process stats) and to ship log files to CloudWatch Logs. Config: a JSON config file (built interactively with amazon-cloudwatch-agent-config-wizard, often stored in SSM Parameter Store) defines which metrics and logs to collect; it can also collect StatsD and collectd custom metrics. Permissions: the instance needs an IAM role with CloudWatchAgentServerPolicy. Gotcha: the old “CloudWatch Logs agent” and per-instance “detailed monitoring” are different things — detailed monitoring just changes EC2 metrics from 5-minute to 1-minute resolution (for a charge); it does not add memory/disk metrics. Only the agent does that.

What is and is not collected without the agent — the table that ends the “why no memory metric?” question forever:

Signal Default (no agent)? Source How to get it Gotcha
EC2 CPUUtilization Yes Hypervisor Built-in AWS/EC2 5-min unless detailed monitoring
EC2 network in/out Yes Hypervisor Built-in AWS/EC2 Bytes, not packets-per-app
EC2 disk read/write ops Yes (EBS-level) Hypervisor Built-in AWS/EC2 Volume I/O, not free space
EC2 memory used % No Guest OS CloudWatch agent Hypervisor can’t see inside
EC2 disk free % No Guest OS CloudWatch agent The one that fills up at 3am
EC2 swap / per-process No Guest OS CloudWatch agent Needs in-OS collection
Application log files No Guest OS CloudWatch agent Or SDK / awslogs driver
Lambda invocations/errors Yes Service Built-in AWS/Lambda Per-function dimensions

CloudWatch alarms, in depth

An alarm watches a single metric (or a metric-math expression) and changes state when it breaches a threshold, optionally triggering actions.

The three states. What: OK (within threshold), ALARM (breaching), INSUFFICIENT_DATA (not enough data to decide — e.g. just created, or the metric stopped reporting). Gotcha: INSUFFICIENT_DATA is not failure; how you treat missing data (below) decides whether it becomes ALARM.

State Meaning Common cause What to wire to it
OK Within threshold Healthy OK action (the all-clear notification)
ALARM Breaching the threshold The actual problem SNS page / Auto Scaling / EC2 action
INSUFFICIENT_DATA Not enough data to decide New alarm, or metric went silent Treat-missing-data decides next state

Threshold and comparison. What: the value and operator (GreaterThanThreshold, LessThanOrEqualToThreshold, etc.), or an anomaly-detection band instead of a static number. When: static thresholds for known limits (CPU > 80%); anomaly detection for metrics whose “normal” varies by time of day.

Period, evaluation periods, and datapoints to alarm. What: the period is the length of each data point (e.g. 60 s); evaluation periods is how many recent periods are considered; datapoints to alarm is how many of those must breach. Example: period 60 s, evaluation periods 5, datapoints to alarm 3 → “alarm if 3 of the last 5 minutes breach” — the M-out-of-N pattern that suppresses single-spike flapping. Default: datapoints = evaluation periods (all must breach). Gotcha: setting evaluation periods to 1 makes the alarm twitchy; M-out-of-N is the production-grade choice.

These three knobs cause more bad pages than anything else; here is what each does and how to set it:

Parameter What it controls CLI flag Typical value If you get it wrong
Period Length of one data point --period 60 s Too short = noisy; too long = slow
Evaluation periods (N) How many recent periods to weigh --evaluation-periods 5 1 = twitchy, flaps on a blip
Datapoints to alarm (M) How many of N must breach --datapoints-to-alarm 3 = N means a single good point clears it
Comparison operator Direction of the breach --comparison-operator GreaterThanThreshold Wrong direction = never fires
Threshold The breach value --threshold workload-specific Set from p99 baseline, not a guess

Missing data treatment. What: what to do when a period has no data. Choices: missing (default — treat as neither breaching nor OK), notBreaching (treat as OK), breaching (treat as ALARM), ignore (keep the current state). When: breaching for “this thing must always report” (a heartbeat); notBreaching to avoid false alarms on metrics that legitimately go quiet. Gotcha: a stopped EC2 instance stops publishing CPUUtilization, so an alarm on it may sit in INSUFFICIENT_DATA forever unless you set the treatment deliberately.

--treat-missing-data Missing period is treated as Use when Example
missing (default) Neither breach nor OK You genuinely do not know Default; often not what you want
notBreaching OK Metric legitimately goes quiet Nightly-idle batch worker
breaching ALARM The thing must always report Heartbeat / liveness metric
ignore Keep current state Avoid flip-flop on gaps Sparse business metric

Alarm actions. What: what happens on state change. Choices: publish to an SNS topic (email/SMS/Lambda/chat), trigger EC2 Auto Scaling policies, perform EC2 actions (stop/terminate/reboot/recover), create OpsItems/incidents in Systems Manager. When: SNS for notification and fan-out; Auto Scaling for elastic capacity; EC2 recover for automatic recovery of an impaired instance onto new hardware. Gotcha: you can set different actions for entering ALARM, OK, and INSUFFICIENT_DATA states — wire an OK action so you get the “all clear” too.

Action type What it does State it usually fires on Limitation
SNS publish Email/SMS/Lambda/chat fan-out ALARM and OK Cost negligible; the default choice
EC2 Auto Scaling policy Add/remove instances ALARM (scale-out) Needs an ASG and a scaling policy
EC2 action (recover) Move impaired instance to new HW ALARM Only certain instance/EBS configs
EC2 action (stop/terminate/reboot) Lifecycle action ALARM Dangerous; scope IAM tightly
SSM OpsItem / incident Open an operational ticket ALARM Needs Systems Manager set up

Composite alarms. What: an alarm whose state is a boolean expression over other alarmsALARM("HighCPU") AND ALARM("HighLatency"), or (A OR B) AND NOT C. When: to reduce alarm noise — page only when several signals agree (a real outage) rather than on every individual flap, and to model dependencies (“don’t page on the app alarm if the database alarm is already firing”). Limit: composite alarms can suppress child notifications via an actions-suppressor. Gotcha: composite alarms cannot perform EC2 or Auto Scaling actions — only notifications/SNS — because they have no single underlying metric.

# Alarm: CPU > 80% for 3 of the last 5 minutes, notify an SNS topic
aws cloudwatch put-metric-alarm \
  --alarm-name ec2-high-cpu \
  --namespace AWS/EC2 --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --statistic Average --period 60 \
  --evaluation-periods 5 --datapoints-to-alarm 3 \
  --threshold 80 --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:ap-south-1:111122223333:ops-alerts

A metric alarm and a composite alarm are not interchangeable — know which you need:

Property Metric alarm Composite alarm
Watches One metric / metric-math A boolean expression over other alarms
Purpose Detect one breach Reduce noise; model dependencies
Can do EC2/ASG actions Yes No (notifications only)
Can suppress children n/a Yes (actions-suppressor)
Typical use “CPU > 80% for 3 of 5” “page only if CPU AND latency breach”

CloudWatch Logs, in depth

CloudWatch Logs is the managed store for log data — from Lambda, the CloudWatch agent, ECS/EKS, API Gateway, VPC Flow Logs, Route 53, and your own apps.

Log groups and log streams. What: a log group is the top-level container (one per application/component, e.g. /aws/lambda/checkout); a log stream is a sequence of log events from a single source within that group (one stream per Lambda execution environment, per EC2 instance, per container). Gotcha: retention, encryption, metric filters and subscription filters are set on the group, not the stream.

Concept Granularity Set on it Example
Log group One per app/component Retention, KMS, filters /aws/lambda/checkout
Log stream One per source instance Nothing configurable one per Lambda env / EC2 host
Log event One record {"level":"ERROR","status":500}

Retention. What: how long events are kept before automatic deletion. Choices: 1 day up to 10 years, or Never expire (the default — and a classic cost trap). When: set a deliberate retention on every group; debug logs 7–30 days, audit logs longer. Gotcha: the default Never expire means logs accumulate and you pay storage forever — always set retention explicitly. New log groups created by some services still default to never-expire.

Log type Suggested retention Why Cost lever
App debug / verbose 7–14 days Useful only while fresh Biggest storage saver
Access / request logs 30–90 days Trend + incident lookback Infrequent-Access log class
Security / audit 1–7 years (or longer) Compliance, forensics Export to S3 / Glacier
Default if you do nothing Never expire The trap Always override it

Encryption. What: log data is encrypted at rest by default; you can associate a KMS key for customer-managed encryption per log group.

Metric filters. What: a pattern that scans incoming log events and increments a CloudWatch metric when it matches — turning unstructured logs into a number you can alarm on. Example: count occurrences of ERROR or "statusCode": 500, or extract a numeric field (latency) from JSON and publish it. When: alarm on “more than N errors per minute” without parsing logs in real time yourself. Gotcha: metric filters only apply to new events after the filter is created — they do not back-fill existing logs; and the metric only emits data points when matches occur (mind missing-data treatment on the alarm). The classic exam example is the CIS-benchmark metric filters on CloudTrail logs that alarm on root-account usage or unauthorised API calls.

Subscription filters. What: stream matching log events in near real time to a destination — Kinesis Data Streams, Firehose (→ S3/OpenSearch/Redshift), or Lambda. When: central log aggregation, real-time processing, or shipping to a SIEM/OpenSearch. Limit: historically one subscription filter per log group; account-level subscription filters and up to two filters per group are now supported. Gotcha: this is the standard path for the structured-logging pipeline pattern — see Structured Logging Pipeline on AWS: CloudWatch → Firehose → OpenSearch for the Firehose-to-OpenSearch build.

A metric filter and a subscription filter sound alike and do opposite jobs:

Filter type Output Destination Use it to Cost driver
Metric filter A CloudWatch metric (a number) CloudWatch metrics Alarm on a log pattern Per metric, near-free
Subscription filter The matching log events themselves Kinesis / Firehose / Lambda Ship/aggregate logs elsewhere Per GB delivered/processed

Logs Insights. What: an interactive, purpose-built query language to search and analyse log data across log groups without exporting it. Capabilities: fields, filter, parse (extract fields from text), stats (aggregate — count, avg, percentiles), sort, limit, and bin() for time-bucketing; auto-discovers fields in JSON logs. When: ad-hoc investigation — “show me the 20 slowest requests in the last hour”, “count 5xx by path”, “which user-agents hit this endpoint”. Cost: billed by the amount of data scanned per query, so narrow the time range and log groups. Gotcha: it queries, it does not alter; results can be saved and added to dashboards.

Logs Insights command What it does Example
fields Select/derive fields to show fields @timestamp, status, duration
filter Keep matching rows filter status = 500
parse Extract fields from text parse @message "user=*;" as user
stats Aggregate (count/avg/pct) stats count(*) by bin(5m)
sort / limit Order and cap results `sort duration desc
# Logs Insights: top 10 slowest requests from a JSON log
fields @timestamp, @message, duration
| filter status = 500
| sort duration desc
| limit 10
# Count errors per 5-minute bucket
filter @message like /ERROR/
| stats count(*) as errors by bin(5m)

Live Tail and other features. What: Live Tail streams matching log events in real time in the console (great during a deploy); log class offers a cheaper Infrequent Access tier for logs you rarely query; export to S3 for long-term archival; Logs anomaly detection flags unusual patterns automatically.

Feature What it gives you When to reach for it
Live Tail Real-time stream in the console Watching a deploy or a live incident
Infrequent Access log class Cheaper storage, limited features Logs you rarely query but must keep
Export to S3 Bulk archival to cheap storage Long-term retention / Athena
Logs anomaly detection Auto-flags unusual log patterns Catching novel errors you didn’t pre-filter
Data Protection Masks sensitive data (PII) inline Logs that may contain emails/cards
Embedded Metric Format (EMF) Emit metrics from a structured log line High-cardinality app metrics without PutMetricData

CloudWatch dashboards & alarms-at-a-glance

A dashboard is a customisable page of widgets (line/stacked-area/number/gauge/bar graphs, alarm-status widgets, logs-table widgets, text, and custom widgets backed by Lambda).

Widget type Shows Best for
Line / stacked-area Metric trends over time Latency, request rate, utilisation
Number / gauge A single current value SLO at-a-glance, error budget
Alarm status State of one or many alarms On-call “is anything red?” panel
Logs table (Insights) Rows from a saved query Recent errors inline on the board
Text / custom (Lambda) Markdown / arbitrary render Runbook links, bespoke visuals
Bar / pie Categorical comparison Errors by service, cost by tag
Explorer Auto-grouped resource graphs by tag Fleet view without hand-built widgets

CloudTrail, in depth — the “who did what”

CloudTrail records API activity in your account — who called which AWS API, when, from where, with what parameters, and whether it succeeded. It is your security and audit backbone, completely separate from CloudWatch’s operational metrics.

Event history (always on, free, 90 days). What: CloudTrail automatically keeps a 90-day, searchable history of management events in every Region with no setup and no charge. When: quick “who deleted this / who changed that” investigations. Limit: 90 days only, management events only, viewable/queryable but not delivered anywhere. Gotcha: for anything beyond 90 days, for data events, or for delivery to S3, you must create a trail.

Trails. What: a configuration that delivers events to an S3 bucket (and optionally CloudWatch Logs and EventBridge) for long-term retention and analysis. Choices: single-Region vs multi-Region (multi-Region is the recommended default — one trail captures all current and future Regions); organisation trail (created in the management account, captures every account in the AWS Organization, member accounts cannot disable it). Gotcha: global-service events (IAM, STS, CloudFront, Route 53) are logged via us-east-1 — if your trail is single-Region elsewhere you will miss them; multi-Region trails capture them correctly.

Trail choice What it captures When to use Gotcha
Single-Region One Region’s events Rarely; isolated test Misses global-service events
Multi-Region All current + future Regions The recommended default Slightly more S3 volume
Organisation trail Every account in the org Multi-account governance Members cannot disable it
With CloudWatch Logs Events also to a log group Metric-filter security alarms Extra ingestion cost
With log-file validation Hash-chained digest files Compliance / forensics Off by default; enable it

The three event categories:

Event type What it captures Default Cost note
Management events Control-plane operations — RunInstances, CreateBucket, AttachRolePolicy, console sign-in, AssumeRole Logged by default (first copy of management events to a trail is free) One free trail copy; additional trails charged per event
Data events High-volume data-plane operations — S3 object GetObject/PutObject, Lambda Invoke, DynamoDB item ops Off by default (must opt in, can be very high volume) Charged per data event delivered
Insights events Detected unusual activity in management or data event volume (e.g. a spike in DeleteBucket or errors) Off by default (opt in) Charged per Insights event analysed

Gotcha (the exam favourite): “I enabled CloudTrail but I cannot see who read this S3 object.” Reads are data events and are off by default — management events do not include object-level S3/Lambda activity. You must enable S3 data events on the trail (and they cost money at scale, so scope them to the buckets that matter).

Read/write filter. What: you can log only Read, only Write, or All events per category — narrowing to Write cuts noise and cost while keeping the changes that matter for audit.

Log-file integrity validation. What: CloudTrail can produce digest files (hash-chained, signed) so you can prove logs were not tampered with after delivery — aws cloudtrail validate-logs. When: compliance and forensics. Gotcha: you must enable it on the trail; it is not on by default.

Where the logs go and how you query them. Delivered as gzipped JSON to S3 (partition by account/Region/date). Query options: Athena (point-and-click table creation from the console), send to CloudWatch Logs for metric filters/alarms (e.g. alarm on root login), or use CloudTrail Lake — a managed, SQL-queryable event data store with its own retention (up to years) that removes the S3+Athena plumbing. Gotcha: delivery to S3 is near real time but not instant (typically within ~15 minutes) — CloudTrail is for audit, not low-latency alerting; for real-time reaction, route CloudTrail events through EventBridge.

Query path What it is Latency Best for
Event history Built-in 90-day console search Seconds Quick “who did this” lookups
Athena over S3 SQL on the delivered JSON Minutes (after ~15 min delivery) Ad-hoc forensics, joins
CloudWatch Logs + metric filter Trail → log group → alarm Near real time on the metric Security alarms (root login)
CloudTrail Lake Managed SQL event store Minutes Long retention, no S3 plumbing
EventBridge Trail event → rule → target Near real time Automated reaction, not just audit
# Look up recent console sign-ins from the always-on Event history
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=ConsoleLogin \
  --max-results 10

# Create a multi-Region trail with log-file validation
aws cloudtrail create-trail \
  --name org-audit-trail \
  --s3-bucket-name my-cloudtrail-logs-111122223333 \
  --is-multi-region-trail --enable-log-file-validation
aws cloudtrail start-logging --name org-audit-trail

AWS Config, in depth — the “what is it, and how did it change”

AWS Config continuously records the configuration of your resources and keeps a timeline of every change, then evaluates that configuration against rules. Where CloudTrail records the event (“someone called AuthorizeSecurityGroupIngress”), Config records the resulting state (“this security group now allows 0.0.0.0/0 on port 22, and here is exactly what it looked like before and after, with the CloudTrail event that caused it”).

The configuration recorder & configuration items. What: the recorder captures, for each supported resource, a configuration item (CI) — a point-in-time snapshot of the resource’s attributes, relationships (this EC2 instance → this ENI → this security group), tags, and a link to the CloudTrail event that triggered the change. Choices: record all supported resource types (recommended) or a selected list; record global resources (IAM) in one Region to avoid duplication. Cost: charged per configuration item recorded and per rule evaluation, so high-churn resources cost more. Gotcha: Config must be turned on per Region and needs an S3 bucket for the configuration snapshots/history and (optionally) an SNS topic for change notifications.

Configuration history & snapshots. What: the full timeline lets you answer “what did this resource look like at 14:00 last Tuesday?” and “show me every change to this bucket policy this month”. Delivered to S3; queryable.

Config rules. What: desired-state checks that mark each resource COMPLIANT or NON_COMPLIANT. Choices: AWS managed rules (hundreds pre-built — s3-bucket-public-read-prohibited, encrypted-volumes, restricted-ssh, iam-password-policy) or custom rules backed by Lambda or Guard (policy-as-code). Trigger types: configuration-change-triggered (evaluate when a resource changes) or periodic (evaluate on a schedule). Gotcha: a rule reports compliance; it does not fix anything by itself.

Rule trigger When it evaluates Best for Gotcha
Configuration change The moment a resource changes Catch drift immediately Needs the recorder on for that type
Periodic On a fixed schedule (e.g. 24h) Account-wide posture checks Up to a period of lag
AWS managed rule Pre-built logic, parameterised 90% of needs Know its parameters/limits
Custom (Lambda / Guard) Your code / policy-as-code Bespoke standards You own the logic and its bugs

Remediation. What: attach an SSM Automation document to a rule to auto-remediate non-compliant resources (e.g. re-enable bucket encryption, remove an open ingress rule) — automatic or on-approval. Gotcha: test remediation in non-prod; an over-eager auto-remediation can fight a legitimate change.

Conformance packs. What: a collection of Config rules + remediation packaged as a single deployable unit (a YAML template) — e.g. an operational best-practices for PCI-DSS pack, or your own internal baseline. When: deploy a whole compliance standard at once, and across an entire AWS Organization with one action. Gotcha: a conformance pack creates its own resources and has its own cost (per rule-evaluation); deleting the pack removes its rules.

Aggregators. What: a multi-account, multi-Region view that rolls compliance and configuration data from many accounts into one dashboard — essential at organisation scale.

CloudTrail and Config are constantly confused; this is the line that separates them:

Dimension CloudTrail AWS Config
Records The API call (the event) The resulting state + its history
Answers “Who called what, when, from where?” “What did it look like, and is it compliant?”
Unit An event record A configuration item (CI)
Evaluates compliance No Yes (rules / packs)
Can remediate No (route via EventBridge) Yes (SSM Automation)
Billed per Event (mgmt free first copy) CI recorded + rule evaluation
# Turn on Config (recorder + delivery channel must be set up first), then deploy a managed rule
aws configservice put-config-rule --config-rule '{
  "ConfigRuleName": "s3-bucket-public-read-prohibited",
  "Source": { "Owner": "AWS", "SourceIdentifier": "S3_BUCKET_PUBLIC_READ_PROHIBITED" }
}'

# Check compliance
aws configservice describe-compliance-by-config-rule \
  --config-rule-names s3-bucket-public-read-prohibited

EventBridge, in depth — turning signals into automation

Amazon EventBridge is the serverless event bus that connects AWS service events, your own application events, and SaaS events to targets — the glue that turns observability signals into automated action. It is the evolution of CloudWatch Events: the two are the same underlying service, the APIs are compatible, and the console moved CloudWatch Events under the EventBridge name. If a question mentions “CloudWatch Events”, read it as EventBridge.

Event buses. What: the pipe events flow through. Choices: the default bus (receives events from AWS services automatically), custom buses (for your own application events, isolating domains), and partner/SaaS buses (events from integrated SaaS providers). Gotcha: AWS service events land on the default bus only — you cannot point them at a custom bus directly.

Bus type Receives Use it for Gotcha
Default bus AWS service events automatically Reacting to AWS events AWS events land here only
Custom bus Your own PutEvents events Isolating app domains You publish to it explicitly
Partner/SaaS bus Integrated SaaS provider events Zendesk/Datadog/etc. triggers Requires the partner integration

Rules and event patterns. What: a rule matches events with an event pattern (JSON matching on fields — source, detail-type, and any nested detail field) and routes matches to up to 5 targets. Example pattern: match every EC2 instance that enters stopped, or every CloudTrail-delivered DeleteBucket, or every Config NON_COMPLIANT finding. Alternative: a scheduled rule (cron/rate expression) for time-based triggers — the serverless replacement for cron. Gotcha: content-based filtering happens before delivery, so you only pay for and process matching events.

Common event patterns you will write, and what each catches — the detail shape is service-specific, so always validate against a real sample:

Goal source Match in detail / detail-type Typical target
EC2 instance stopped aws.ec2 detail.state = stopped SNS / Lambda
Config resource non-compliant aws.config detail.newEvaluationResult.complianceType = NON_COMPLIANT SSM Automation
CloudTrail DeleteBucket aws.s3 (via CloudTrail) detail.eventName = DeleteBucket SNS alert
GuardDuty finding aws.guardduty detail.severity >= 7 Lambda / SNS
Auto Scaling launch failed aws.autoscaling detail-type = EC2 Instance Launch Unsuccessful SNS
Scheduled (cron) rate(5 minutes) / cron(...) Lambda batch

Targets. What: where matched events go — Lambda, SNS/SQS, Step Functions, Systems Manager Automation/Run Command, Kinesis/Firehose, ECS tasks, API destinations (any HTTP endpoint), another event bus, and more. Features: input transformer to reshape the event before delivery, dead-letter queues for failed deliveries, and automatic retries. When: the canonical auto-remediation loop — Config flags a non-compliant resource → EventBridge rule matches → Lambda or SSM Automation fixes it → SNS notifies the team.

Target What it does with the event Canonical use
Lambda Runs your code Custom remediation / enrichment
SNS / SQS Notify / queue for later Fan-out / buffered processing
Step Functions Start a state machine Multi-step orchestrated response
SSM Automation / Run Command Run a managed runbook Idempotent infra remediation
ECS task Launch a container task Batch / heavier processing
API destination POST to any HTTP endpoint PagerDuty/Slack/3rd-party

EventBridge Pipes and Scheduler. What: Pipes is point-to-point source→(filter→enrich)→target plumbing (e.g. DynamoDB stream → Lambda enrichment → Step Functions) that replaces glue code; Scheduler is a dedicated, scalable cron/at-scale scheduling service (millions of schedules, one-time or recurring) that goes beyond scheduled rules. Gotcha: for high-volume fan-out and SaaS integration reach for EventBridge; these are covered in depth in EventBridge Event-Driven Architecture: Buses, Schema & Pipes.

# Rule: when any EC2 instance enters "stopped", notify an SNS topic
aws events put-rule --name ec2-stopped \
  --event-pattern '{"source":["aws.ec2"],"detail-type":["EC2 Instance State-change Notification"],"detail":{"state":["stopped"]}}'

aws events put-targets --rule ec2-stopped \
  --targets "Id"="1","Arn"="arn:aws:sns:ap-south-1:111122223333:ops-alerts"

AWS X-Ray, in brief — the “where did the time go”

AWS X-Ray is the distributed tracing service: it follows a single request as it travels through your application — API Gateway → Lambda → DynamoDB → an external HTTP call — and shows a service map and a timeline (trace) of where the latency and errors occurred. Where a metric says “p99 latency is 2 s” and a log says “this request failed”, X-Ray says “the 2 seconds was spent in this DynamoDB call on this code path”.

How the three pillars divide the labour — and why you need all three:

Pillar Service Answers Strength Weakness
Metrics CloudWatch metrics That something is wrong Cheap, fast to alarm Aggregated, no detail
Logs CloudWatch Logs Why it is wrong Rich detail Costly at volume, slower
Traces X-Ray Where the time went Per-request, cross-service Sampled, needs instrumentation

Architecture at a glance

The diagram below ties the services into one loop you can read left to right. On the left, your workloads emit signal: EC2 (with the CloudWatch agent for the memory and disk metrics the hypervisor cannot see), Lambda and API Gateway, and every IAM principal whose API calls become audit records. Those signals fan into the middle: CloudWatch holds the what — metrics (15-month retention, 1-second high resolution, p99 percentiles) and Logs Insights queries, with alarms wired as M-of-N and composite so on-call is paged only when signals agree. In parallel the audit plane captures the who and the driftCloudTrail records every API call (management free, data events opt-in, multi-Region so global-service events from us-east-1 are not lost) and AWS Config records the resulting resource state and evaluates rules, both shipping to a tamper-proof log-archive bucket with SSE-KMS, Object Lock (WORM) and log-file validation.

From there the loop closes through detection and automation. A metric filter turns a log pattern (a root-account login) into a metric and an alarm; EventBridge matches any event — an alarm state-change, a Config NON_COMPLIANT finding, a CloudTrail DeleteBucket — against a JSON pattern and routes it to up to five targets: SNS to notify (wire the OK action too, not just ALARM), SSM Automation or Step Functions to remediate with an idempotent runbook, or a Lambda for custom fixes with a dead-letter queue on failure. The five numbered badges mark the silent failures that break this loop in production — no memory metric without the agent, an alarm that flaps or sits grey, a CloudTrail that misses the event you need, an archive that is deletable or a KMS key that blocks delivery, and an EventBridge rule that never matches or loops. Keep the picture in mind: CloudWatch is what, CloudTrail is who, Config is the state over time, and EventBridge is how you turn any of those into action.

AWS observability loop: EC2/Lambda/IAM workloads emit metrics and logs to CloudWatch and API calls to CloudTrail and AWS Config, feeding a detection plane of metric filters and EventBridge that routes to SNS, SSM/Step Functions and Lambda remediation, with five numbered failure points annotated

Real-world scenario

Lumara Retail runs a mid-sized e-commerce platform on AWS across three accounts (prod, staging, security) in ap-south-1, with a small on-call rotation of four engineers. For a year their observability was “good enough” — EC2 default metrics, a handful of alarms, CloudTrail switched on in the console — until a Friday-evening incident exposed every gap at once.

It started as slow checkout. The p99 latency alarm never fired, because the only latency alarm they had used Average, which the slow tail hid. On-call eventually noticed from customer tweets, opened the dashboard, and found it empty — the dashboard had been built in us-east-1 months earlier, but the workload ran in ap-south-1, the classic wrong-Region blank graph. When they finally looked at the right Region, EC2 CPU was fine but the instances were thrashing; there were no memory metrics because the CloudWatch agent had never been installed, so a memory leak in the cart service was invisible. They restarted the fleet, which “fixed” it, and went to bed without a root cause.

Saturday the real damage surfaced. A junior engineer, debugging, had widened a security group to 0.0.0.0/0 on port 6379 to reach Redis directly — and nobody knew, because the team had no Config recorder and no alarm on security-group changes. The exposure sat open for eleven hours. They only found it when the GuardDuty finding fired, and then could not answer the auditor’s first two questions: who opened it (they had CloudTrail Event history, so eventually yes — AuthorizeSecurityGroupIngress by the junior’s role) and what the group looked like before (they had no Config timeline, so no).

The rebuild took a focused week and followed this article. They installed the CloudWatch agent via SSM across the fleet (memory and disk metrics now flow), and rebuilt alarms with p99 statistics, M-of-N evaluation (3 of 5) and deliberate missing-data treatment, grouped under composite alarms so a single flap no longer pages four people at 2am. They created a multi-Region organisation CloudTrail with log-file validation, delivering to an Object-Lock bucket in the security account, and routed it to CloudWatch Logs with metric filters alarming on root login, console-sign-in failures, and security-group changes — the CIS set. They turned on AWS Config in every account with restricted-ssh, s3-bucket-public-read-prohibited and encrypted-volumes, wired an EventBridge rule from Config NON_COMPLIANT to an SSM Automation runbook that closes an open ingress and an SNS notice to the channel. The next time someone widened a security group, Config flagged it in under two minutes, EventBridge fired, the runbook reverted it, and the team got a Slack message — the eleven-hour exposure became a ninety-second self-healing event. The lesson Lumara took away was exactly the triad: they had been treating “monitoring” as one thing, when what, who and what-changed are three different jobs needing three different services tied together by a fourth.

Advantages and disadvantages

The native AWS observability stack is the default for good reasons, and it has real edges. Weigh them before defaulting to a third-party platform:

Advantages Disadvantages
Zero-setup default metrics for most services No memory/disk without the agent (the gap that surprises everyone)
Tight IAM, KMS and Organizations integration Per-Region model means easy “empty dashboard” mistakes
CloudTrail + Config give audit & compliance out of the box Costs creep silently (never-expire logs, high-cardinality metrics, data events)
EventBridge closes the loop to automated remediation Logs Insights/dashboards are weaker UX than dedicated APM tools
No infrastructure to run; scales with the account Cross-account/cross-Region needs deliberate monitoring-account setup
Pay-per-use with a usable Free Tier Multi-cloud teams end up running a second tool anyway

When each side matters: for a single-cloud AWS shop that wants audit, compliance and remediation tied to the platform’s own IAM and Organizations, the native stack is hard to beat and cheap to start. For deep application performance management, rich dashboards, or a multi-cloud estate, teams often pair CloudWatch (for the AWS-native signals, CloudTrail and Config that only AWS can produce) with a third-party APM for the application layer — exporting metrics via metric streams and logs via subscription filters. The mistake is treating it as either/or: even teams on Datadog or Grafana keep CloudTrail, Config and EventBridge, because those are AWS-only capabilities.

Hands-on lab

You will publish a custom metric, create an alarm on it, send the alarm to SNS, create a log group and query it with Logs Insights, and confirm the always-on CloudTrail event history — then clean everything up. Run this in CloudShell (the aws CLI is pre-installed and already authenticated) or any configured terminal. Everything here is Free Tier-friendly: CloudWatch gives 10 custom metrics, 10 alarms, 5 GB of logs and 1 million API requests free per month; CloudTrail’s management-event history is free; the costs at this scale are effectively zero. We delete the chargeable bits at the end.

Step 1 — Set variables.

REGION=ap-south-1
TOPIC=obs-lab-alerts
export AWS_DEFAULT_REGION=$REGION

Step 2 — Create an SNS topic and subscribe your email.

TOPIC_ARN=$(aws sns create-topic --name $TOPIC --query TopicArn --output text)
aws sns subscribe --topic-arn $TOPIC_ARN --protocol email \
  --notification-endpoint you@example.com
# Check your inbox and click "Confirm subscription"
echo "$TOPIC_ARN"

Expected: an ARN like arn:aws:sns:ap-south-1:111122223333:obs-lab-alerts, and a confirmation email.

Step 3 — Publish a custom metric.

aws cloudwatch put-metric-data \
  --namespace "ObsLab" --metric-name QueueDepth \
  --unit Count --value 5

Validation: aws cloudwatch list-metrics --namespace ObsLab should list QueueDepth within a minute or two (custom metrics can take a moment to appear).

Step 4 — Create an alarm that pages on a deep queue.

aws cloudwatch put-metric-alarm \
  --alarm-name obs-lab-queue-deep \
  --namespace ObsLab --metric-name QueueDepth \
  --statistic Maximum --period 60 \
  --evaluation-periods 1 --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions "$TOPIC_ARN"

Step 5 — Drive the metric over the threshold and watch it alarm.

aws cloudwatch put-metric-data --namespace ObsLab --metric-name QueueDepth --value 50
# wait a minute, then:
aws cloudwatch describe-alarms --alarm-names obs-lab-queue-deep \
  --query 'MetricAlarms[0].StateValue' --output text

Expected: the state moves to ALARM and you receive an SNS email. Push a low value (--value 1) to see it return to OK.

Step 6 — Create a log group, log an event, and query with Logs Insights.

aws logs create-log-group --log-group-name /obs-lab/app
aws logs put-retention-policy --log-group-name /obs-lab/app --retention-in-days 1
STREAM=run-1
aws logs create-log-stream --log-group-name /obs-lab/app --log-stream-name $STREAM
TS=$(($(date +%s)*1000))
aws logs put-log-events --log-group-name /obs-lab/app --log-stream-name $STREAM \
  --log-events timestamp=$TS,message='{"level":"ERROR","status":500,"path":"/checkout"}'

Now run a Logs Insights query (console: CloudWatch → Logs Insights → select /obs-lab/app), or from the CLI:

QID=$(aws logs start-query --log-group-name /obs-lab/app \
  --start-time $(($(date +%s)-3600)) --end-time $(date +%s) \
  --query-string 'fields @timestamp, status, path | filter status = 500 | sort @timestamp desc' \
  --query queryId --output text)
sleep 5
aws logs get-query-results --query-id "$QID"

Expected: the 500 event comes back with its status and path fields extracted from the JSON.

Step 7 — Confirm the CloudTrail event history (no trail needed).

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=PutMetricAlarm \
  --max-results 5 --query 'Events[].Username'

Expected: your identity appears as the user who created the alarm in Step 4 — the who did what, free and always on.

Cleanup.

aws cloudwatch delete-alarms --alarm-names obs-lab-queue-deep
aws logs delete-log-group --log-group-name /obs-lab/app
aws sns delete-topic --topic-arn "$TOPIC_ARN"
# custom metrics expire automatically (15 months) and cannot be deleted manually

Cost note. Within Free Tier this lab is effectively free. Outside it: custom metrics are billed per metric/month, alarms per alarm/month, logs per GB ingested and stored (set retention!), Logs Insights per GB scanned, and SNS email is free for the first 1,000 notifications. The single biggest real-world cost trap here is log groups left on “Never expire” — always set a retention policy.

Common mistakes & troubleshooting

Observability problems are mostly self-inflicted configuration gaps. Use this as a symptom → root cause → confirm → fix playbook; the Confirm column is the exact command or console path that proves it before you change anything.

# Symptom Likely root cause Confirm (command / path) Fix
1 Dashboard / alarm / metrics empty Wrong Region (CloudWatch is regional) Check console Region selector; echo $AWS_DEFAULT_REGION Switch Region; set widget Region per panel
2 No memory / disk metrics for EC2 Not default metrics; agent not installed aws cloudwatch list-metrics --namespace CWAgent is empty Install CloudWatch agent + CloudWatchAgentServerPolicy
3 Alarm stuck in INSUFFICIENT_DATA Metric stopped reporting; missing-data = missing describe-alarms shows the state; source resource stopped Set --treat-missing-data (breaching for heartbeats)
4 Alarm flaps on single spikes Evaluation periods = 1 describe-alarms --query 'MetricAlarms[].EvaluationPeriods' Use M-of-N (e.g. 3 of 5 datapoints)
5 Paged four times for one outage No composite alarm; every child pages Multiple correlated alarms all in ALARM Wrap children in a composite; suppress child actions
6 CloudTrail shows no S3 object reads Object access is a data event, off by default Trail config shows no data event selectors Enable S3 data events on the trail (mind cost)
7 Missing IAM / CloudFront / Route 53 events Global events log via us-east-1; trail single-Region describe-trails --query 'trailList[].IsMultiRegionTrail' is false Use a multi-Region trail
8 CloudWatch bill creeping up High-cardinality custom metrics; never-expire logs; data events Billing → Cost Explorer by usage type Cut dimensions; set log retention; scope data events
9 Config rule says NON_COMPLIANT, nothing fixes it Config evaluates, does not remediate Rule shows NON_COMPLIANT, no remediation attached Attach SSM Automation or wire EventBridge → Lambda
10 Config rule reports nothing for a resource Recorder off, or scope excludes the type describe-configuration-recorder-status recording=false Turn recorder on; allSupported + global resources
11 Metric filter never increments Filter created after the events; only new events count No data points on the metric since creation Re-test with a fresh matching log line
12 EventBridge rule never fires Event pattern does not match the real event shape aws events test-event-pattern --event-pattern ... --event ... Fix the JSON pattern against a real sample event
13 Logs Insights query is slow / costly Scanning too many groups / too wide a time range Query stats show GB scanned Narrow time range and log-group selection
14 Auto-remediation loops or fights a deploy Non-idempotent runbook; no exception path CloudTrail shows the fix firing repeatedly Make runbook idempotent; honour an exception tag

Best practices

Security notes

Observability is a security control surface; lock it down accordingly.

Control What to do Why it matters
Protect the audit trail Deliver CloudTrail to a dedicated log-archive account bucket with Block Public Access, deletion-restricting bucket policy, Object Lock (WORM), and log-file validation Stops an attacker (or a mistake) erasing the evidence
Use an organisation trail Create it in the management account so members cannot disable it Guarantees every account is covered
Alarm on security events Route CloudTrail → CloudWatch Logs, add metric filters + alarms for root login, sign-in failures, IAM/SG/CloudTrail changes (the CIS set) Turns the audit log into real-time detection
Least privilege on dangerous actions Restrict and alarm on cloudwatch:PutMetricData, logs:DeleteLogGroup, cloudtrail:StopLogging These poison metrics, destroy evidence, or blind you
Encrypt logs and trails Associate a KMS CMK with log groups and the trail; control bucket read access Protects sensitive data at rest; gates who can read
Continuous posture Feed Config + Security Hub findings into EventBridge for automated response Closes the loop from detection to remediation
Real-time reaction path Consume CloudTrail via EventBridge, not S3 delivery, for anything time-sensitive S3 delivery is ~15 min — too slow for live response

For the deeper audit build, see CloudTrail & Config for Audit & Compliance.

Cost & sizing — the levers that move the bill

The observability bill is driven by volume, and a few levers control it. Know the unit you are billed in for each service and where the cost actually concentrates:

Service Billed per The cost trap The lever
CloudWatch metrics Custom metric / month + API requests Per-user / per-request dimensions explode count Aggregate dimensions; publish fewer
Detailed / high-res metrics Per-metric premium 1-min / 1-s everywhere Use only where sub-minute matters
CloudWatch Logs GB ingested + GB stored Never-expire retention Set retention; Infrequent-Access class
Logs Insights GB scanned per query Wide time ranges over all groups Narrow time + group selection
Dashboards Per dashboard / month (after 3) Many one-off dashboards Consolidate; define as code
CloudTrail Per event (mgmt first copy free) Data events at S3/Lambda scale Scope data events to key buckets
AWS Config Per CI recorded + per rule eval High-churn resources, broad recording Record selectively; tune rule triggers
X-Ray Per trace recorded + scanned Full-rate tracing Lower sampling; trace what matters

Rough INR/USD intuition at small scale: a single account with a dozen custom metrics, ten alarms, a few GB of logs at 14-day retention, a multi-Region management trail, Config on with a handful of managed rules, and modest EventBridge traffic typically lands in the low single-digit USD per month (a few hundred INR) — dominated by Config CIs and any data events you turn on. The pattern: turn on broad recording for security/audit (CloudTrail management events, Config) where it is cheap or required, and be deliberate about the high-volume items (data events, high-cardinality custom metrics, full-rate tracing).

Interview & exam questions

  1. What is the difference between CloudTrail and CloudWatch? CloudTrail records API activitywho did what (audit/governance). CloudWatch records operational telemetry — metrics, logs, alarms, dashboards — what is happening with your resources (monitoring). Different jobs. (SAA-C03, SOA-C02)

  2. CloudTrail vs AWS Config — when do you use each? CloudTrail records the event (the API call that changed something). Config records the resulting configuration state and its history, and evaluates it against rules. “Who deleted the SG?” → CloudTrail. “What did the SG look like last week and is it compliant?” → Config. (SAA-C03, SCS-C02)

  3. Why don’t I see memory or disk-space metrics for my EC2 instance? They are not default metrics — the hypervisor cannot see inside the guest OS. Install the CloudWatch agent to collect them. Detailed monitoring only changes resolution (5 min → 1 min), it does not add memory/disk. (SOA-C02)

  4. What is a composite alarm and why use one? An alarm whose state is a boolean expression over other alarms, used to cut alarm noise — page only when multiple signals agree, or suppress dependent alarms. It cannot perform EC2/Auto Scaling actions, only notifications. (SOA-C02)

  5. Explain period, evaluation periods, and datapoints to alarm. Period = length of each data point; evaluation periods = how many recent periods to consider; datapoints to alarm = how many of those must breach. Together they give M-out-of-N (e.g. 3 of 5) to suppress single-spike flapping. (SOA-C02)

  6. CloudTrail management vs data vs Insights events? Management = control-plane operations (logged by default, one free trail copy). Data = high-volume data-plane operations like S3 GetObject / Lambda Invoke (off by default, charged). Insights = detected unusual activity in event volume (off by default, charged). (SCS-C02, SOA-C02)

  7. I enabled CloudTrail but can’t see who read an S3 object — why? Object reads/writes are data events, which are off by default. Enable S3 data events on the trail (they cost money at scale). (SCS-C02)

  8. How do you alarm on a pattern in your logs (e.g. “more than 5 errors a minute”)? Create a metric filter on the log group that increments a metric when the pattern matches, then put a CloudWatch alarm on that metric. (Metric filters only apply to new events.) (SOA-C02, DVA-C02)

  9. How do you query terabytes of logs ad-hoc without exporting them? CloudWatch Logs Insights — a query language (fields/filter/parse/stats/sort) billed per GB scanned, so narrow the time range and log groups. (SOA-C02, DVA-C02)

  10. What is the relationship between EventBridge and CloudWatch Events? They are the same service; EventBridge is the current name and superset (custom buses, SaaS partners, schema registry, Pipes, Scheduler). APIs are compatible. (DVA-C02, SAA-C03)

  11. How would you auto-remediate a non-compliant resource? AWS Config rule detects NON_COMPLIANT → attach an SSM Automation remediation, or route the Config event through EventBridge to a Lambda/SSM action, and notify via SNS. (SCS-C02, SOA-C02)

  12. You need a tamper-evident, multi-account audit log retained for years — what do you build? A multi-Region organisation CloudTrail with log-file validation, delivered to a dedicated log-archive account S3 bucket with Block Public Access and Object Lock; query with CloudTrail Lake or Athena. (SCS-C02)

Quick check

  1. Which service answers “who deleted this resource”?
  2. What does “datapoints to alarm = 3, evaluation periods = 5” mean?
  3. Are S3 object-level reads captured by CloudTrail by default?
  4. What is the default retention for a new CloudWatch log group?
  5. Which service records a timeline of a resource’s configuration and evaluates compliance rules?

Answers

  1. CloudTrail (the who did what; for the resulting state over time you’d use AWS Config).
  2. Alarm if 3 of the last 5 evaluation periods breach the threshold (the M-out-of-N pattern that suppresses single spikes).
  3. No — object-level access is a data event and is off by default; you must enable S3 data events on the trail.
  4. Never expire — which is why you should always set an explicit retention policy.
  5. AWS Config.

Glossary

Next steps

AWSCloudWatchCloudTrailAWS ConfigEventBridgeSOA-C02
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments