AWS Observability, In Depth: CloudWatch, CloudTrail, Config & EventBridge

When something goes wrong in AWS at three in the morning, three questions decide how quickly you recover. What is broken right now? — a metric is in alarm, a queue is backing up, latency has tripled. Who changed something? — somebody, or some automation, touched a resource and the timeline matters. When did the configuration drift away from what it should be? — a security group opened, an S3 bucket lost its encryption, an IAM policy widened. AWS answers those three questions with three different services, and the single most useful mental model in all of AWS observability is to keep them straight: CloudWatch tells you what is happening, CloudTrail tells you who did what, and AWS Config tells you what the configuration is and how it changed over time. Tie them together with EventBridge, which turns any of those signals into automated action, and you have a complete observability and governance loop.

This lesson is deliberately exhaustive. Observability is one of the most heavily examined and most operationally important areas of AWS, and it is also where engineers most often have a fuzzy, incomplete picture — they know CloudWatch shows graphs and CloudTrail shows API calls, but cannot explain a composite alarm, a metric filter, a Config conformance pack, or why CloudWatch Events and EventBridge are the same thing under two names. We go through each service with the same treatment used across this course: what it is · the choices · the default · when to use it · the trade-off · the limits · the cost impact · the gotcha. Every core operation comes with a real aws CLI command so you can reproduce it by hand, and because this is a reference you will return to mid-incident, every concept, limit, option and failure mode is also laid out as a scannable table — read the prose once, then keep the tables open when the pager fires.

By the end you will be able to instrument a workload with metrics, alarms, logs and dashboards; query logs at scale with Logs Insights; record an audit trail of every API call with CloudTrail; track and enforce configuration with AWS Config; wire it all into automated remediation with EventBridge; and add distributed tracing with X-Ray. Enough to ace an SOA-C02 or SAA-C03 question, hold your own in an interview, and run a production account you can actually see into.

What problem this solves

Without an observability strategy you are flying blind. The workload runs, until it does not, and when it does not the only signals you have are a customer complaint and a blank stare at the console. The pain is concrete: you cannot tell whether the latency spike is the database or the app; you cannot prove who deleted the security group that took prod down; you cannot say what the IAM policy looked like before it was widened; and you have no automated way to catch a public S3 bucket the moment it is created instead of in next quarter’s audit. Every one of those is a different question, and reaching for the wrong service — grepping CloudWatch for “who deleted this”, or expecting CloudTrail to show you a resource’s state last Tuesday — burns the hour you do not have.

What breaks without it: incidents run long because nobody can localise the fault; security findings surface weeks late; compliance audits become archaeology; and cost quietly balloons because logs default to never expire and high-cardinality custom metrics multiply unwatched. Who hits this: every team running anything in AWS past a single toy instance, and hardest the teams running multi-account organisations where signal is scattered across Regions and accounts with no single pane of glass.

To frame the whole field before the deep dive, here is the question each service answers, the signal it produces, and the first place you look:

Question in the incident	Service that answers it	Signal it produces	First place to look	The classic mistake
What is broken right now?	CloudWatch	Metrics, alarms, logs, dashboards	Alarms list / dashboard for the Region	Looking in the wrong Region (empty graph)
Who did what, and when?	CloudTrail	Every API call: identity, action, IP, result	Event history (90-day, free)	Expecting data-plane reads (off by default)
What is the config, and how did it change?	AWS Config	Configuration-item timeline + compliance	Resource timeline / rule compliance	Confusing it with CloudTrail’s events
How do I react automatically?	EventBridge	Event match → target (Lambda/SSM/SNS)	Rule pattern + target wiring	Pointing AWS events at a custom bus
Where did the time go in one request?	X-Ray	Service map + per-request trace	Trace map for the slow operation	Expecting every request (it samples)

Learning objectives

By the end of this lesson you can:

Explain the who / what / when triad and map each question to CloudWatch, CloudTrail and AWS Config.
Describe CloudWatch metrics end to end — namespaces, dimensions, statistics, standard vs high resolution, custom metrics, and the unified CloudWatch agent.
Configure CloudWatch alarms — the three states, period/evaluation/datapoints-to-alarm, missing-data treatment, composite alarms, and alarm actions.
Work with CloudWatch Logs — log groups and streams, retention, metric filters, subscription filters, and querying with Logs Insights.
Build CloudWatch dashboards and explain widgets, cross-account/cross-Region views.
Set up CloudTrail correctly — management vs data vs Insights events, multi-Region organisation trails, log-file validation, and the always-on Event history.
Use AWS Config to record configuration history, evaluate rules, and deploy conformance packs for compliance.
Use EventBridge rules and buses to automate responses to events, and explain how it relates to “CloudWatch Events”.
Add basic distributed tracing with AWS X-Ray and know when to reach for it.

Prerequisites & where this fits

You should already be comfortable with the AWS basics — the Management Console and the aws CLI with a configured profile (covered in AWS Console, CLI, CloudShell & SDK First Steps), Regions and IAM roles/policies (see IAM Fundamentals: Users, Roles, Policies & Evaluation), and at least one service you can generate signal from (an EC2 instance, a Lambda function, or an S3 bucket). No prior monitoring experience is assumed; every term is defined. This is the Observability lesson of the AWS Zero-to-Hero course’s Foundation/Intermediate track, and it is the anchor the operational lessons build on: troubleshooting playbooks, frontend SLO monitoring, and structured logging pipelines all reference the metrics, alarms, logs and trails introduced here.

A quick map of where each piece sits and what depends on it, so you can see the shape before the detail:

Layer	Service(s)	Scope	Depends on	Built on top of it
Telemetry collection	CloudWatch metrics + agent	Per-Region	IAM role on the source	Alarms, dashboards, SLOs
Log storage & query	CloudWatch Logs + Insights	Per-Region	Log group + retention	Metric filters, subscriptions
Audit of API calls	CloudTrail	Multi-Region capable	S3 bucket, optional KMS	Athena/Lake, security alarms
Configuration state	AWS Config	Per-Region (global once)	S3 + recorder	Rules, packs, aggregators
Event routing	EventBridge	Per-Region (default bus)	Source events	Remediation, fan-out
Distributed tracing	X-Ray	Per-Region, sampled	Instrumentation	Application Signals / SLOs

Core concepts

Before any console blade, fix five mental models. They explain why these services are shaped the way they are.

Observability is signal plus the ability to ask new questions. Monitoring answers questions you decided to ask in advance (an alarm you pre-wired). Observability is being able to ask questions you did not anticipate — slicing logs by a field you did not pre-aggregate, correlating a latency spike with a deploy, tracing one slow request across five services. CloudWatch metrics and alarms are the monitoring half; Logs Insights, CloudTrail and X-Ray are what give you the observability half.

The three pillars: metrics, logs, traces. A metric is a number over time (CPU %, request count, queue depth) — cheap to store, fast to alarm on, but aggregated, so it tells you that something is wrong, not why. A log is a timestamped record of an event (a line of text or JSON) — rich and detailed, the why, but expensive at volume and slower to query. A trace follows a single request as it hops between services, showing where the time went. CloudWatch covers metrics and logs; X-Ray covers traces; all three live under the CloudWatch umbrella in the console today.

The who / what / when triad. Keep these three audit-and-observe questions separate because three different services answer them:

Question	Service	What it records	Retention default	Real-time?
What is happening / happened?	CloudWatch	Metrics, alarms, logs, dashboards — operational health	Metrics 15 mo; logs never-expire	Near real time
Who did what, and when?	CloudTrail	Every API call: identity, action, source IP, parameters, result	90-day history; trails as configured	~5–15 min to S3
What is the config and how did it change?	AWS Config	Resource configuration snapshots + a timeline of changes + compliance	Until you stop the recorder	Minutes after change

An interviewer’s favourite trap is “I need to know who deleted this security group” (CloudTrail, not CloudWatch) versus “I need to know what my security group looked like last Tuesday and what changed” (AWS Config, not CloudTrail). CloudTrail tells you the event; Config tells you the state over time.

Everything is regional, with a few global exceptions. CloudWatch metrics, alarms, log groups, Config recorders and EventBridge buses are per-Region — a metric published in ap-south-1 does not appear in us-east-1. CloudTrail can be multi-Region (one trail captures all Regions) and some global services (IAM, CloudFront, Route 53) log only to us-east-1. This regionality is the single most common cause of “my alarm/dashboard is empty” — you are looking in the wrong Region.

Push vs pull, and the agent. AWS services push their own metrics to CloudWatch automatically (EC2 CPU, ELB request count, Lambda invocations) at no charge for the default set. But CloudWatch cannot see inside an EC2 instance — memory and disk usage are not default metrics because the hypervisor cannot see the guest OS. To get those, and to ship the instance’s log files, you install the CloudWatch agent inside the OS. This push model and the agent gap are exam-classic.

The vocabulary in one table

Pin down every moving part before the deep sections; the glossary repeats these for lookup, but this is the mental model side by side:

Term	One-line definition	Where it lives	Why it matters
Metric	A time-ordered series of numeric data points	CloudWatch (per-Region)	The cheap “that something is wrong” signal
Namespace	Container grouping related metrics	Metric identity	`AWS/` is reserved; yours is anything
Dimension	Name/value pair scoping a metric to a resource	Metric identity	Each combo is a distinct billable metric
Alarm	A watcher on a metric with states + actions	CloudWatch	Turns a metric into a page or an action
Log group / stream	Container / per-source sequence of log events	CloudWatch Logs	Retention & filters live on the group
Metric filter	Pattern that turns matching logs into a metric	On a log group	Alarm on “N errors/min” without parsing
Trail	Config that delivers CloudTrail events to S3	CloudTrail	The durable, queryable audit record
Management / data event	Control-plane / data-plane API activity	CloudTrail	Data events are off by default and cost
Configuration item (CI)	Point-in-time snapshot of a resource	AWS Config	The unit you are billed per, and queried by
Config rule	Desired-state check → COMPLIANT/NON_COMPLIANT	AWS Config	Detects drift; does not fix it
Conformance pack	Bundle of Config rules + remediation	AWS Config	Deploy a whole standard at once
Event bus / rule	The pipe / the matcher routing events to targets	EventBridge	Turns any signal into automation
Segment / trace	One service’s work / one request’s full path	X-Ray	Shows which hop is slow

CloudWatch metrics, in depth

A metric is the fundamental CloudWatch concept: a time-ordered set of data points, each a number with a timestamp, identified by a namespace and zero or more dimensions.

Namespaces. What: a container that groups related metrics so names do not collide. AWS service metrics use the AWS/<service> convention (AWS/EC2, AWS/Lambda, AWS/ApplicationELB, AWS/RDS, AWS/SQS). Your custom metrics go in any namespace you choose (e.g. MyApp/Checkout). Gotcha: the AWS/ prefix is reserved — you cannot publish into it.

The service namespaces you will reach for most, the dimension that scopes them, and a signal worth alarming on in each:

Namespace	Service	Key dimension	A metric to watch	Why it matters
`AWS/EC2`	EC2 instances	`InstanceId`	`CPUUtilization`, `StatusCheckFailed`	Health + recover trigger
`AWS/Lambda`	Lambda	`FunctionName`	`Errors`, `Throttles`, `Duration`	Failures + concurrency limits
`AWS/ApplicationELB`	ALB	`LoadBalancer`	`HTTPCode_Target_5XX_Count`, `TargetResponseTime`	Backend errors + latency SLO
`AWS/RDS`	RDS / Aurora	`DBInstanceIdentifier`	`CPUUtilization`, `FreeableMemory`, `DatabaseConnections`	DB saturation
`AWS/SQS`	SQS	`QueueName`	`ApproximateAgeOfOldestMessage`	Backlog / stuck consumer
`AWS/DynamoDB`	DynamoDB	`TableName`	`ThrottledRequests`, `ConsumedReadCapacityUnits`	Capacity / hot partition
`AWS/ApiGateway`	API Gateway	`ApiName`	`5XXError`, `Latency`, `Count`	API health
`CWAgent`	CloudWatch agent	`InstanceId`, `path`	`mem_used_percent`, `disk_used_percent`	The memory/disk gap

Dimensions. What: name/value pairs that scope a metric to a specific resource — e.g. the AWS/EC2 CPUUtilization metric carries an InstanceId dimension so each instance has its own line. Choices: up to 30 dimensions per metric; each unique combination of namespace + name + dimensions is a distinct metric (this is what you are billed for as a custom metric). Gotcha: dimensions are part of the metric’s identity — if you publish CPUUtilization with InstanceId=i-abc and also without any dimension, those are two different metrics, and CloudWatch does not auto-aggregate across dimensions for custom metrics.

Statistics. What: how data points in a period are aggregated for display/alarming. Choices: Average, Sum, Minimum, Maximum, SampleCount, and percentiles (p50, p90, p99, or any pNN.NN) and trimmed means. When: use Average for utilisation, Sum for counts (requests, errors), Maximum for “did it ever spike”, and percentiles for latency SLOs (a p99 latency alarm catches the slow tail an average hides). Gotcha: percentiles need raw samples — they do not work on metrics already pre-aggregated as statistic sets unless you publish the full distribution.

Pick the statistic that matches the question, not out of habit:

Statistic	What it answers	Best for	Trap if misused
`Average`	Typical value over the period	CPU/memory utilisation	Hides spikes that page you
`Sum`	Total over the period	Request count, errors, bytes	Meaningless for a gauge like CPU%
`Maximum`	Worst point in the period	“Did it ever breach?” heartbeats	One blip looks like sustained load
`Minimum`	Best point in the period	Free-capacity floors	Rarely what you alarm on
`SampleCount`	How many data points landed	Detecting a metric going silent	Not the value, just the count
`p90 / p99`	The slow tail latency	Latency SLOs, user experience	Needs raw samples, not stat sets

Resolution. What: how granular the data points are. Choices: standard resolution = 1-minute granularity (the default for AWS service metrics); high resolution = down to 1-second granularity for custom metrics. When: high resolution for fast-moving signals where a one-minute average smooths over a problem (e.g. a spiky request rate, autoscaling on sub-minute bursts). Cost/trade-off: high-resolution alarms can evaluate at 10-second periods but cost more and high-resolution data points cost more to publish. Gotcha: a high-resolution alarm period below 60 s costs more per alarm.

Resolution	Granularity	Who uses it	Alarm period floor	Cost note
Standard (default)	1 minute	AWS service metrics, most custom	60 s	Included for default AWS metrics
Detailed monitoring (EC2)	1 minute (vs 5)	EC2 you want finer	60 s	Per-instance charge; no new metrics
High resolution	1 second	Spiky custom metrics	10 s	Higher per-metric + per-alarm cost

Retention (automatic, free, and you cannot change it). CloudWatch keeps metric data at decreasing granularity and discards it after 15 months:

Original period	Retained as	For
< 60 s (high-res)	1-second data points	3 hours
60 s (1 min)	1-minute data points	15 days
5 min	5-minute data points	63 days
1 hour	1-hour data points	15 months

Gotcha: after 15 months metrics are gone — if you need longer retention for capacity planning or compliance, export to S3 (via metric streams) or store a copy yourself.

Custom metrics. What: numbers your own application or scripts push to CloudWatch with PutMetricData. When: business and in-app signals CloudWatch cannot see — orders per minute, cache hit ratio, queue processing lag, memory usage. Cost: billed per custom metric per month (per unique namespace+name+dimensions combination), plus per-API-request charges; this adds up fast if you publish a metric per user or per request — use dimensions thoughtfully. Gotcha: PutMetricData accepts timestamps up to two weeks in the past and up to two hours in the future; outside that the point is rejected.

# Publish a custom metric
aws cloudwatch put-metric-data \
  --namespace "MyApp/Checkout" \
  --metric-name OrdersProcessed \
  --unit Count --value 42 \
  --dimensions Environment=prod,Service=cart

Metric math and search expressions. What: compute new time series from existing ones — errors / requests * 100 for an error rate, SUM across instances, anomaly-detection bands. Search expressions (SEARCH('{AWS/EC2,InstanceId} CPUUtilization', 'Average')) match metrics dynamically so a graph or alarm auto-includes new instances. When: fleet-wide dashboards and ratio alarms. Gotcha: you cannot alarm directly on a raw search expression result unless wrapped appropriately; metric-math alarms are supported.

The CloudWatch agent. What: a single binary (amazon-cloudwatch-agent) you install on EC2 or on-premises servers to collect OS-level metrics CloudWatch cannot see (memory, disk space, disk/network I/O, swap, per-process stats) and to ship log files to CloudWatch Logs. Config: a JSON config file (built interactively with amazon-cloudwatch-agent-config-wizard, often stored in SSM Parameter Store) defines which metrics and logs to collect; it can also collect StatsD and collectd custom metrics. Permissions: the instance needs an IAM role with CloudWatchAgentServerPolicy. Gotcha: the old “CloudWatch Logs agent” and per-instance “detailed monitoring” are different things — detailed monitoring just changes EC2 metrics from 5-minute to 1-minute resolution (for a charge); it does not add memory/disk metrics. Only the agent does that.

What is and is not collected without the agent — the table that ends the “why no memory metric?” question forever:

Signal	Default (no agent)?	Source	How to get it	Gotcha
EC2 `CPUUtilization`	Yes	Hypervisor	Built-in `AWS/EC2`	5-min unless detailed monitoring
EC2 network in/out	Yes	Hypervisor	Built-in `AWS/EC2`	Bytes, not packets-per-app
EC2 disk read/write ops	Yes (EBS-level)	Hypervisor	Built-in `AWS/EC2`	Volume I/O, not free space
EC2 memory used %	No	Guest OS	CloudWatch agent	Hypervisor can’t see inside
EC2 disk free %	No	Guest OS	CloudWatch agent	The one that fills up at 3am
EC2 swap / per-process	No	Guest OS	CloudWatch agent	Needs in-OS collection
Application log files	No	Guest OS	CloudWatch agent	Or SDK / `awslogs` driver
Lambda invocations/errors	Yes	Service	Built-in `AWS/Lambda`	Per-function dimensions

CloudWatch alarms, in depth

An alarm watches a single metric (or a metric-math expression) and changes state when it breaches a threshold, optionally triggering actions.

The three states. What: OK (within threshold), ALARM (breaching), INSUFFICIENT_DATA (not enough data to decide — e.g. just created, or the metric stopped reporting). Gotcha: INSUFFICIENT_DATA is not failure; how you treat missing data (below) decides whether it becomes ALARM.

State	Meaning	Common cause	What to wire to it
`OK`	Within threshold	Healthy	OK action (the all-clear notification)
`ALARM`	Breaching the threshold	The actual problem	SNS page / Auto Scaling / EC2 action
`INSUFFICIENT_DATA`	Not enough data to decide	New alarm, or metric went silent	Treat-missing-data decides next state

Threshold and comparison. What: the value and operator (GreaterThanThreshold, LessThanOrEqualToThreshold, etc.), or an anomaly-detection band instead of a static number. When: static thresholds for known limits (CPU > 80%); anomaly detection for metrics whose “normal” varies by time of day.

Period, evaluation periods, and datapoints to alarm. What: the period is the length of each data point (e.g. 60 s); evaluation periods is how many recent periods are considered; datapoints to alarm is how many of those must breach. Example: period 60 s, evaluation periods 5, datapoints to alarm 3 → “alarm if 3 of the last 5 minutes breach” — the M-out-of-N pattern that suppresses single-spike flapping. Default: datapoints = evaluation periods (all must breach). Gotcha: setting evaluation periods to 1 makes the alarm twitchy; M-out-of-N is the production-grade choice.

These three knobs cause more bad pages than anything else; here is what each does and how to set it:

Parameter	What it controls	CLI flag	Typical value	If you get it wrong
Period	Length of one data point	`--period`	60 s	Too short = noisy; too long = slow
Evaluation periods (N)	How many recent periods to weigh	`--evaluation-periods`	5	1 = twitchy, flaps on a blip
Datapoints to alarm (M)	How many of N must breach	`--datapoints-to-alarm`	3	= N means a single good point clears it
Comparison operator	Direction of the breach	`--comparison-operator`	`GreaterThanThreshold`	Wrong direction = never fires
Threshold	The breach value	`--threshold`	workload-specific	Set from p99 baseline, not a guess

Missing data treatment. What: what to do when a period has no data. Choices: missing (default — treat as neither breaching nor OK), notBreaching (treat as OK), breaching (treat as ALARM), ignore (keep the current state). When: breaching for “this thing must always report” (a heartbeat); notBreaching to avoid false alarms on metrics that legitimately go quiet. Gotcha: a stopped EC2 instance stops publishing CPUUtilization, so an alarm on it may sit in INSUFFICIENT_DATA forever unless you set the treatment deliberately.

`--treat-missing-data`	Missing period is treated as	Use when	Example
`missing` (default)	Neither breach nor OK	You genuinely do not know	Default; often not what you want
`notBreaching`	OK	Metric legitimately goes quiet	Nightly-idle batch worker
`breaching`	ALARM	The thing must always report	Heartbeat / liveness metric
`ignore`	Keep current state	Avoid flip-flop on gaps	Sparse business metric

Alarm actions. What: what happens on state change. Choices: publish to an SNS topic (email/SMS/Lambda/chat), trigger EC2 Auto Scaling policies, perform EC2 actions (stop/terminate/reboot/recover), create OpsItems/incidents in Systems Manager. When: SNS for notification and fan-out; Auto Scaling for elastic capacity; EC2 recover for automatic recovery of an impaired instance onto new hardware. Gotcha: you can set different actions for entering ALARM, OK, and INSUFFICIENT_DATA states — wire an OK action so you get the “all clear” too.

Action type	What it does	State it usually fires on	Limitation
SNS publish	Email/SMS/Lambda/chat fan-out	ALARM and OK	Cost negligible; the default choice
EC2 Auto Scaling policy	Add/remove instances	ALARM (scale-out)	Needs an ASG and a scaling policy
EC2 action (recover)	Move impaired instance to new HW	ALARM	Only certain instance/EBS configs
EC2 action (stop/terminate/reboot)	Lifecycle action	ALARM	Dangerous; scope IAM tightly
SSM OpsItem / incident	Open an operational ticket	ALARM	Needs Systems Manager set up

Composite alarms. What: an alarm whose state is a boolean expression over other alarms — ALARM("HighCPU") AND ALARM("HighLatency"), or (A OR B) AND NOT C. When: to reduce alarm noise — page only when several signals agree (a real outage) rather than on every individual flap, and to model dependencies (“don’t page on the app alarm if the database alarm is already firing”). Limit: composite alarms can suppress child notifications via an actions-suppressor. Gotcha: composite alarms cannot perform EC2 or Auto Scaling actions — only notifications/SNS — because they have no single underlying metric.

# Alarm: CPU > 80% for 3 of the last 5 minutes, notify an SNS topic
aws cloudwatch put-metric-alarm \
  --alarm-name ec2-high-cpu \
  --namespace AWS/EC2 --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --statistic Average --period 60 \
  --evaluation-periods 5 --datapoints-to-alarm 3 \
  --threshold 80 --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:ap-south-1:111122223333:ops-alerts

A metric alarm and a composite alarm are not interchangeable — know which you need:

Property	Metric alarm	Composite alarm
Watches	One metric / metric-math	A boolean expression over other alarms
Purpose	Detect one breach	Reduce noise; model dependencies
Can do EC2/ASG actions	Yes	No (notifications only)
Can suppress children	n/a	Yes (actions-suppressor)
Typical use	“CPU > 80% for 3 of 5”	“page only if CPU AND latency breach”

CloudWatch Logs, in depth

CloudWatch Logs is the managed store for log data — from Lambda, the CloudWatch agent, ECS/EKS, API Gateway, VPC Flow Logs, Route 53, and your own apps.

Log groups and log streams. What: a log group is the top-level container (one per application/component, e.g. /aws/lambda/checkout); a log stream is a sequence of log events from a single source within that group (one stream per Lambda execution environment, per EC2 instance, per container). Gotcha: retention, encryption, metric filters and subscription filters are set on the group, not the stream.

Concept	Granularity	Set on it	Example
Log group	One per app/component	Retention, KMS, filters	`/aws/lambda/checkout`
Log stream	One per source instance	Nothing configurable	one per Lambda env / EC2 host
Log event	One record	—	`{"level":"ERROR","status":500}`

Retention. What: how long events are kept before automatic deletion. Choices: 1 day up to 10 years, or Never expire (the default — and a classic cost trap). When: set a deliberate retention on every group; debug logs 7–30 days, audit logs longer. Gotcha: the default Never expire means logs accumulate and you pay storage forever — always set retention explicitly. New log groups created by some services still default to never-expire.

Log type	Suggested retention	Why	Cost lever
App debug / verbose	7–14 days	Useful only while fresh	Biggest storage saver
Access / request logs	30–90 days	Trend + incident lookback	Infrequent-Access log class
Security / audit	1–7 years (or longer)	Compliance, forensics	Export to S3 / Glacier
Default if you do nothing	Never expire	The trap	Always override it

Encryption. What: log data is encrypted at rest by default; you can associate a KMS key for customer-managed encryption per log group.

Metric filters. What: a pattern that scans incoming log events and increments a CloudWatch metric when it matches — turning unstructured logs into a number you can alarm on. Example: count occurrences of ERROR or "statusCode": 500, or extract a numeric field (latency) from JSON and publish it. When: alarm on “more than N errors per minute” without parsing logs in real time yourself. Gotcha: metric filters only apply to new events after the filter is created — they do not back-fill existing logs; and the metric only emits data points when matches occur (mind missing-data treatment on the alarm). The classic exam example is the CIS-benchmark metric filters on CloudTrail logs that alarm on root-account usage or unauthorised API calls.

Subscription filters. What: stream matching log events in near real time to a destination — Kinesis Data Streams, Firehose (→ S3/OpenSearch/Redshift), or Lambda. When: central log aggregation, real-time processing, or shipping to a SIEM/OpenSearch. Limit: historically one subscription filter per log group; account-level subscription filters and up to two filters per group are now supported. Gotcha: this is the standard path for the structured-logging pipeline pattern — see Structured Logging Pipeline on AWS: CloudWatch → Firehose → OpenSearch for the Firehose-to-OpenSearch build.

A metric filter and a subscription filter sound alike and do opposite jobs:

Filter type	Output	Destination	Use it to	Cost driver
Metric filter	A CloudWatch metric (a number)	CloudWatch metrics	Alarm on a log pattern	Per metric, near-free
Subscription filter	The matching log events themselves	Kinesis / Firehose / Lambda	Ship/aggregate logs elsewhere	Per GB delivered/processed

Logs Insights. What: an interactive, purpose-built query language to search and analyse log data across log groups without exporting it. Capabilities: fields, filter, parse (extract fields from text), stats (aggregate — count, avg, percentiles), sort, limit, and bin() for time-bucketing; auto-discovers fields in JSON logs. When: ad-hoc investigation — “show me the 20 slowest requests in the last hour”, “count 5xx by path”, “which user-agents hit this endpoint”. Cost: billed by the amount of data scanned per query, so narrow the time range and log groups. Gotcha: it queries, it does not alter; results can be saved and added to dashboards.

Logs Insights command	What it does	Example
`fields`	Select/derive fields to show	`fields @timestamp, status, duration`
`filter`	Keep matching rows	`filter status = 500`
`parse`	Extract fields from text	`parse @message "user=*;" as user`
`stats`	Aggregate (count/avg/pct)	`stats count(*) by bin(5m)`
`sort` / `limit`	Order and cap results	`sort duration desc

# Logs Insights: top 10 slowest requests from a JSON log
fields @timestamp, @message, duration
| filter status = 500
| sort duration desc
| limit 10

# Count errors per 5-minute bucket
filter @message like /ERROR/
| stats count(*) as errors by bin(5m)

Live Tail and other features. What: Live Tail streams matching log events in real time in the console (great during a deploy); log class offers a cheaper Infrequent Access tier for logs you rarely query; export to S3 for long-term archival; Logs anomaly detection flags unusual patterns automatically.

Feature	What it gives you	When to reach for it
Live Tail	Real-time stream in the console	Watching a deploy or a live incident
Infrequent Access log class	Cheaper storage, limited features	Logs you rarely query but must keep
Export to S3	Bulk archival to cheap storage	Long-term retention / Athena
Logs anomaly detection	Auto-flags unusual log patterns	Catching novel errors you didn’t pre-filter
Data Protection	Masks sensitive data (PII) inline	Logs that may contain emails/cards
Embedded Metric Format (EMF)	Emit metrics from a structured log line	High-cardinality app metrics without `PutMetricData`

CloudWatch dashboards & alarms-at-a-glance

A dashboard is a customisable page of widgets (line/stacked-area/number/gauge/bar graphs, alarm-status widgets, logs-table widgets, text, and custom widgets backed by Lambda).

What: visualise metrics, alarm states and Logs Insights results on one screen for an on-call view.
Cross-account / cross-Region: dashboards (and alarms and Logs Insights) can pull from multiple accounts and Regions when you enable CloudWatch cross-account observability with a monitoring account — essential in multi-account organisations so on-call has one pane of glass.
Cost: the first 3 dashboards (up to 50 metrics) are free; beyond that there is a small monthly charge per dashboard.
Gotcha: dashboards are global in the console list but each widget targets a specific Region; mixed-Region dashboards must set the Region per widget. Dashboards are not auto-created — define them as code (the dashboard body is JSON; manage via CloudFormation/CDK/Terraform) so they are version-controlled.

Widget type	Shows	Best for
Line / stacked-area	Metric trends over time	Latency, request rate, utilisation
Number / gauge	A single current value	SLO at-a-glance, error budget
Alarm status	State of one or many alarms	On-call “is anything red?” panel
Logs table (Insights)	Rows from a saved query	Recent errors inline on the board
Text / custom (Lambda)	Markdown / arbitrary render	Runbook links, bespoke visuals
Bar / pie	Categorical comparison	Errors by service, cost by tag
Explorer	Auto-grouped resource graphs by tag	Fleet view without hand-built widgets

CloudTrail, in depth — the “who did what”

CloudTrail records API activity in your account — who called which AWS API, when, from where, with what parameters, and whether it succeeded. It is your security and audit backbone, completely separate from CloudWatch’s operational metrics.

Event history (always on, free, 90 days). What: CloudTrail automatically keeps a 90-day, searchable history of management events in every Region with no setup and no charge. When: quick “who deleted this / who changed that” investigations. Limit: 90 days only, management events only, viewable/queryable but not delivered anywhere. Gotcha: for anything beyond 90 days, for data events, or for delivery to S3, you must create a trail.

Trails. What: a configuration that delivers events to an S3 bucket (and optionally CloudWatch Logs and EventBridge) for long-term retention and analysis. Choices: single-Region vs multi-Region (multi-Region is the recommended default — one trail captures all current and future Regions); organisation trail (created in the management account, captures every account in the AWS Organization, member accounts cannot disable it). Gotcha: global-service events (IAM, STS, CloudFront, Route 53) are logged via us-east-1 — if your trail is single-Region elsewhere you will miss them; multi-Region trails capture them correctly.

Trail choice	What it captures	When to use	Gotcha
Single-Region	One Region’s events	Rarely; isolated test	Misses global-service events
Multi-Region	All current + future Regions	The recommended default	Slightly more S3 volume
Organisation trail	Every account in the org	Multi-account governance	Members cannot disable it
With CloudWatch Logs	Events also to a log group	Metric-filter security alarms	Extra ingestion cost
With log-file validation	Hash-chained digest files	Compliance / forensics	Off by default; enable it

The three event categories:

Event type	What it captures	Default	Cost note
Management events	Control-plane operations — `RunInstances`, `CreateBucket`, `AttachRolePolicy`, console sign-in, `AssumeRole`	Logged by default (first copy of management events to a trail is free)	One free trail copy; additional trails charged per event
Data events	High-volume data-plane operations — S3 object `GetObject`/`PutObject`, Lambda `Invoke`, DynamoDB item ops	Off by default (must opt in, can be very high volume)	Charged per data event delivered
Insights events	Detected unusual activity in management or data event volume (e.g. a spike in `DeleteBucket` or errors)	Off by default (opt in)	Charged per Insights event analysed

Gotcha (the exam favourite): “I enabled CloudTrail but I cannot see who read this S3 object.” Reads are data events and are off by default — management events do not include object-level S3/Lambda activity. You must enable S3 data events on the trail (and they cost money at scale, so scope them to the buckets that matter).

Read/write filter. What: you can log only Read, only Write, or All events per category — narrowing to Write cuts noise and cost while keeping the changes that matter for audit.

Log-file integrity validation. What: CloudTrail can produce digest files (hash-chained, signed) so you can prove logs were not tampered with after delivery — aws cloudtrail validate-logs. When: compliance and forensics. Gotcha: you must enable it on the trail; it is not on by default.

Where the logs go and how you query them. Delivered as gzipped JSON to S3 (partition by account/Region/date). Query options: Athena (point-and-click table creation from the console), send to CloudWatch Logs for metric filters/alarms (e.g. alarm on root login), or use CloudTrail Lake — a managed, SQL-queryable event data store with its own retention (up to years) that removes the S3+Athena plumbing. Gotcha: delivery to S3 is near real time but not instant (typically within ~15 minutes) — CloudTrail is for audit, not low-latency alerting; for real-time reaction, route CloudTrail events through EventBridge.

Query path	What it is	Latency	Best for
Event history	Built-in 90-day console search	Seconds	Quick “who did this” lookups
Athena over S3	SQL on the delivered JSON	Minutes (after ~15 min delivery)	Ad-hoc forensics, joins
CloudWatch Logs + metric filter	Trail → log group → alarm	Near real time on the metric	Security alarms (root login)
CloudTrail Lake	Managed SQL event store	Minutes	Long retention, no S3 plumbing
EventBridge	Trail event → rule → target	Near real time	Automated reaction, not just audit

# Look up recent console sign-ins from the always-on Event history
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=ConsoleLogin \
  --max-results 10

# Create a multi-Region trail with log-file validation
aws cloudtrail create-trail \
  --name org-audit-trail \
  --s3-bucket-name my-cloudtrail-logs-111122223333 \
  --is-multi-region-trail --enable-log-file-validation
aws cloudtrail start-logging --name org-audit-trail

AWS Config, in depth — the “what is it, and how did it change”

AWS Config continuously records the configuration of your resources and keeps a timeline of every change, then evaluates that configuration against rules. Where CloudTrail records the event (“someone called AuthorizeSecurityGroupIngress”), Config records the resulting state (“this security group now allows 0.0.0.0/0 on port 22, and here is exactly what it looked like before and after, with the CloudTrail event that caused it”).

The configuration recorder & configuration items. What: the recorder captures, for each supported resource, a configuration item (CI) — a point-in-time snapshot of the resource’s attributes, relationships (this EC2 instance → this ENI → this security group), tags, and a link to the CloudTrail event that triggered the change. Choices: record all supported resource types (recommended) or a selected list; record global resources (IAM) in one Region to avoid duplication. Cost: charged per configuration item recorded and per rule evaluation, so high-churn resources cost more. Gotcha: Config must be turned on per Region and needs an S3 bucket for the configuration snapshots/history and (optionally) an SNS topic for change notifications.

Configuration history & snapshots. What: the full timeline lets you answer “what did this resource look like at 14:00 last Tuesday?” and “show me every change to this bucket policy this month”. Delivered to S3; queryable.

Config rules. What: desired-state checks that mark each resource COMPLIANT or NON_COMPLIANT. Choices: AWS managed rules (hundreds pre-built — s3-bucket-public-read-prohibited, encrypted-volumes, restricted-ssh, iam-password-policy) or custom rules backed by Lambda or Guard (policy-as-code). Trigger types: configuration-change-triggered (evaluate when a resource changes) or periodic (evaluate on a schedule). Gotcha: a rule reports compliance; it does not fix anything by itself.

Rule trigger	When it evaluates	Best for	Gotcha
Configuration change	The moment a resource changes	Catch drift immediately	Needs the recorder on for that type
Periodic	On a fixed schedule (e.g. 24h)	Account-wide posture checks	Up to a period of lag
AWS managed rule	Pre-built logic, parameterised	90% of needs	Know its parameters/limits
Custom (Lambda / Guard)	Your code / policy-as-code	Bespoke standards	You own the logic and its bugs

Remediation. What: attach an SSM Automation document to a rule to auto-remediate non-compliant resources (e.g. re-enable bucket encryption, remove an open ingress rule) — automatic or on-approval. Gotcha: test remediation in non-prod; an over-eager auto-remediation can fight a legitimate change.

Conformance packs. What: a collection of Config rules + remediation packaged as a single deployable unit (a YAML template) — e.g. an operational best-practices for PCI-DSS pack, or your own internal baseline. When: deploy a whole compliance standard at once, and across an entire AWS Organization with one action. Gotcha: a conformance pack creates its own resources and has its own cost (per rule-evaluation); deleting the pack removes its rules.

Aggregators. What: a multi-account, multi-Region view that rolls compliance and configuration data from many accounts into one dashboard — essential at organisation scale.

CloudTrail and Config are constantly confused; this is the line that separates them:

Dimension	CloudTrail	AWS Config
Records	The API call (the event)	The resulting state + its history
Answers	“Who called what, when, from where?”	“What did it look like, and is it compliant?”
Unit	An event record	A configuration item (CI)
Evaluates compliance	No	Yes (rules / packs)
Can remediate	No (route via EventBridge)	Yes (SSM Automation)
Billed per	Event (mgmt free first copy)	CI recorded + rule evaluation

# Turn on Config (recorder + delivery channel must be set up first), then deploy a managed rule
aws configservice put-config-rule --config-rule '{
  "ConfigRuleName": "s3-bucket-public-read-prohibited",
  "Source": { "Owner": "AWS", "SourceIdentifier": "S3_BUCKET_PUBLIC_READ_PROHIBITED" }
}'

# Check compliance
aws configservice describe-compliance-by-config-rule \
  --config-rule-names s3-bucket-public-read-prohibited

EventBridge, in depth — turning signals into automation

Amazon EventBridge is the serverless event bus that connects AWS service events, your own application events, and SaaS events to targets — the glue that turns observability signals into automated action. It is the evolution of CloudWatch Events: the two are the same underlying service, the APIs are compatible, and the console moved CloudWatch Events under the EventBridge name. If a question mentions “CloudWatch Events”, read it as EventBridge.

Event buses. What: the pipe events flow through. Choices: the default bus (receives events from AWS services automatically), custom buses (for your own application events, isolating domains), and partner/SaaS buses (events from integrated SaaS providers). Gotcha: AWS service events land on the default bus only — you cannot point them at a custom bus directly.

Bus type	Receives	Use it for	Gotcha
Default bus	AWS service events automatically	Reacting to AWS events	AWS events land here only
Custom bus	Your own `PutEvents` events	Isolating app domains	You publish to it explicitly
Partner/SaaS bus	Integrated SaaS provider events	Zendesk/Datadog/etc. triggers	Requires the partner integration

Rules and event patterns. What: a rule matches events with an event pattern (JSON matching on fields — source, detail-type, and any nested detail field) and routes matches to up to 5 targets. Example pattern: match every EC2 instance that enters stopped, or every CloudTrail-delivered DeleteBucket, or every Config NON_COMPLIANT finding. Alternative: a scheduled rule (cron/rate expression) for time-based triggers — the serverless replacement for cron. Gotcha: content-based filtering happens before delivery, so you only pay for and process matching events.

Common event patterns you will write, and what each catches — the detail shape is service-specific, so always validate against a real sample:

Goal	`source`	Match in `detail` / `detail-type`	Typical target
EC2 instance stopped	`aws.ec2`	`detail.state = stopped`	SNS / Lambda
Config resource non-compliant	`aws.config`	`detail.newEvaluationResult.complianceType = NON_COMPLIANT`	SSM Automation
CloudTrail `DeleteBucket`	`aws.s3` (via CloudTrail)	`detail.eventName = DeleteBucket`	SNS alert
GuardDuty finding	`aws.guardduty`	`detail.severity >= 7`	Lambda / SNS
Auto Scaling launch failed	`aws.autoscaling`	`detail-type = EC2 Instance Launch Unsuccessful`	SNS
Scheduled (cron)	—	`rate(5 minutes)` / `cron(...)`	Lambda batch

Targets. What: where matched events go — Lambda, SNS/SQS, Step Functions, Systems Manager Automation/Run Command, Kinesis/Firehose, ECS tasks, API destinations (any HTTP endpoint), another event bus, and more. Features: input transformer to reshape the event before delivery, dead-letter queues for failed deliveries, and automatic retries. When: the canonical auto-remediation loop — Config flags a non-compliant resource → EventBridge rule matches → Lambda or SSM Automation fixes it → SNS notifies the team.

Target	What it does with the event	Canonical use
Lambda	Runs your code	Custom remediation / enrichment
SNS / SQS	Notify / queue for later	Fan-out / buffered processing
Step Functions	Start a state machine	Multi-step orchestrated response
SSM Automation / Run Command	Run a managed runbook	Idempotent infra remediation
ECS task	Launch a container task	Batch / heavier processing
API destination	POST to any HTTP endpoint	PagerDuty/Slack/3rd-party

EventBridge Pipes and Scheduler. What: Pipes is point-to-point source→(filter→enrich)→target plumbing (e.g. DynamoDB stream → Lambda enrichment → Step Functions) that replaces glue code; Scheduler is a dedicated, scalable cron/at-scale scheduling service (millions of schedules, one-time or recurring) that goes beyond scheduled rules. Gotcha: for high-volume fan-out and SaaS integration reach for EventBridge; these are covered in depth in EventBridge Event-Driven Architecture: Buses, Schema & Pipes.

# Rule: when any EC2 instance enters "stopped", notify an SNS topic
aws events put-rule --name ec2-stopped \
  --event-pattern '{"source":["aws.ec2"],"detail-type":["EC2 Instance State-change Notification"],"detail":{"state":["stopped"]}}'

aws events put-targets --rule ec2-stopped \
  --targets "Id"="1","Arn"="arn:aws:sns:ap-south-1:111122223333:ops-alerts"

AWS X-Ray, in brief — the “where did the time go”

AWS X-Ray is the distributed tracing service: it follows a single request as it travels through your application — API Gateway → Lambda → DynamoDB → an external HTTP call — and shows a service map and a timeline (trace) of where the latency and errors occurred. Where a metric says “p99 latency is 2 s” and a log says “this request failed”, X-Ray says “the 2 seconds was spent in this DynamoDB call on this code path”.

Segments and subsegments: a segment is the work done by one service for a request; subsegments break that into downstream calls (a query, an SDK call). A trace is all segments for one request, stitched by a trace ID propagated in headers.
Instrumentation: enable on Lambda/API Gateway with a checkbox (Active tracing), or use the X-Ray SDK / OpenTelemetry (ADOT) in your code; the X-Ray daemon (or the CloudWatch agent / ADOT collector) buffers and ships segments.
Sampling: to control cost, X-Ray samples (by default a small fixed number plus a percentage of additional requests) rather than tracing everything; configurable via sampling rules.
When to use it: microservices and serverless where a request crosses several services and you need to find which hop is slow or erroring. For a single monolith, logs and metrics are usually enough.
Gotcha: tracing is per-Region and sampled — do not expect every request in the map; raise sampling only with cost in mind. X-Ray is now surfaced under CloudWatch in the console as part of unified observability (CloudWatch Application Signals builds SLOs on top of these traces). The deep build is in AWS X-Ray: Service Map, Segments & ADOT Tracing on EKS.

How the three pillars divide the labour — and why you need all three:

Pillar	Service	Answers	Strength	Weakness
Metrics	CloudWatch metrics	That something is wrong	Cheap, fast to alarm	Aggregated, no detail
Logs	CloudWatch Logs	Why it is wrong	Rich detail	Costly at volume, slower
Traces	X-Ray	Where the time went	Per-request, cross-service	Sampled, needs instrumentation

Architecture at a glance

The diagram below ties the services into one loop you can read left to right. On the left, your workloads emit signal: EC2 (with the CloudWatch agent for the memory and disk metrics the hypervisor cannot see), Lambda and API Gateway, and every IAM principal whose API calls become audit records. Those signals fan into the middle: CloudWatch holds the what — metrics (15-month retention, 1-second high resolution, p99 percentiles) and Logs Insights queries, with alarms wired as M-of-N and composite so on-call is paged only when signals agree. In parallel the audit plane captures the who and the drift — CloudTrail records every API call (management free, data events opt-in, multi-Region so global-service events from us-east-1 are not lost) and AWS Config records the resulting resource state and evaluates rules, both shipping to a tamper-proof log-archive bucket with SSE-KMS, Object Lock (WORM) and log-file validation.

From there the loop closes through detection and automation. A metric filter turns a log pattern (a root-account login) into a metric and an alarm; EventBridge matches any event — an alarm state-change, a Config NON_COMPLIANT finding, a CloudTrail DeleteBucket — against a JSON pattern and routes it to up to five targets: SNS to notify (wire the OK action too, not just ALARM), SSM Automation or Step Functions to remediate with an idempotent runbook, or a Lambda for custom fixes with a dead-letter queue on failure. The five numbered badges mark the silent failures that break this loop in production — no memory metric without the agent, an alarm that flaps or sits grey, a CloudTrail that misses the event you need, an archive that is deletable or a KMS key that blocks delivery, and an EventBridge rule that never matches or loops. Keep the picture in mind: CloudWatch is what, CloudTrail is who, Config is the state over time, and EventBridge is how you turn any of those into action.

AWS observability loop: EC2/Lambda/IAM workloads emit metrics and logs to CloudWatch and API calls to CloudTrail and AWS Config, feeding a detection plane of metric filters and EventBridge that routes to SNS, SSM/Step Functions and Lambda remediation, with five numbered failure points annotated

Real-world scenario

Lumara Retail runs a mid-sized e-commerce platform on AWS across three accounts (prod, staging, security) in ap-south-1, with a small on-call rotation of four engineers. For a year their observability was “good enough” — EC2 default metrics, a handful of alarms, CloudTrail switched on in the console — until a Friday-evening incident exposed every gap at once.

It started as slow checkout. The p99 latency alarm never fired, because the only latency alarm they had used Average, which the slow tail hid. On-call eventually noticed from customer tweets, opened the dashboard, and found it empty — the dashboard had been built in us-east-1 months earlier, but the workload ran in ap-south-1, the classic wrong-Region blank graph. When they finally looked at the right Region, EC2 CPU was fine but the instances were thrashing; there were no memory metrics because the CloudWatch agent had never been installed, so a memory leak in the cart service was invisible. They restarted the fleet, which “fixed” it, and went to bed without a root cause.

Saturday the real damage surfaced. A junior engineer, debugging, had widened a security group to 0.0.0.0/0 on port 6379 to reach Redis directly — and nobody knew, because the team had no Config recorder and no alarm on security-group changes. The exposure sat open for eleven hours. They only found it when the GuardDuty finding fired, and then could not answer the auditor’s first two questions: who opened it (they had CloudTrail Event history, so eventually yes — AuthorizeSecurityGroupIngress by the junior’s role) and what the group looked like before (they had no Config timeline, so no).

The rebuild took a focused week and followed this article. They installed the CloudWatch agent via SSM across the fleet (memory and disk metrics now flow), and rebuilt alarms with p99 statistics, M-of-N evaluation (3 of 5) and deliberate missing-data treatment, grouped under composite alarms so a single flap no longer pages four people at 2am. They created a multi-Region organisation CloudTrail with log-file validation, delivering to an Object-Lock bucket in the security account, and routed it to CloudWatch Logs with metric filters alarming on root login, console-sign-in failures, and security-group changes — the CIS set. They turned on AWS Config in every account with restricted-ssh, s3-bucket-public-read-prohibited and encrypted-volumes, wired an EventBridge rule from Config NON_COMPLIANT to an SSM Automation runbook that closes an open ingress and an SNS notice to the channel. The next time someone widened a security group, Config flagged it in under two minutes, EventBridge fired, the runbook reverted it, and the team got a Slack message — the eleven-hour exposure became a ninety-second self-healing event. The lesson Lumara took away was exactly the triad: they had been treating “monitoring” as one thing, when what, who and what-changed are three different jobs needing three different services tied together by a fourth.

Advantages and disadvantages

The native AWS observability stack is the default for good reasons, and it has real edges. Weigh them before defaulting to a third-party platform:

Advantages	Disadvantages
Zero-setup default metrics for most services	No memory/disk without the agent (the gap that surprises everyone)
Tight IAM, KMS and Organizations integration	Per-Region model means easy “empty dashboard” mistakes
CloudTrail + Config give audit & compliance out of the box	Costs creep silently (never-expire logs, high-cardinality metrics, data events)
EventBridge closes the loop to automated remediation	Logs Insights/dashboards are weaker UX than dedicated APM tools
No infrastructure to run; scales with the account	Cross-account/cross-Region needs deliberate monitoring-account setup
Pay-per-use with a usable Free Tier	Multi-cloud teams end up running a second tool anyway

When each side matters: for a single-cloud AWS shop that wants audit, compliance and remediation tied to the platform’s own IAM and Organizations, the native stack is hard to beat and cheap to start. For deep application performance management, rich dashboards, or a multi-cloud estate, teams often pair CloudWatch (for the AWS-native signals, CloudTrail and Config that only AWS can produce) with a third-party APM for the application layer — exporting metrics via metric streams and logs via subscription filters. The mistake is treating it as either/or: even teams on Datadog or Grafana keep CloudTrail, Config and EventBridge, because those are AWS-only capabilities.

Hands-on lab

You will publish a custom metric, create an alarm on it, send the alarm to SNS, create a log group and query it with Logs Insights, and confirm the always-on CloudTrail event history — then clean everything up. Run this in CloudShell (the aws CLI is pre-installed and already authenticated) or any configured terminal. Everything here is Free Tier-friendly: CloudWatch gives 10 custom metrics, 10 alarms, 5 GB of logs and 1 million API requests free per month; CloudTrail’s management-event history is free; the costs at this scale are effectively zero. We delete the chargeable bits at the end.

Step 1 — Set variables.

REGION=ap-south-1
TOPIC=obs-lab-alerts
export AWS_DEFAULT_REGION=$REGION

Step 2 — Create an SNS topic and subscribe your email.

TOPIC_ARN=$(aws sns create-topic --name $TOPIC --query TopicArn --output text)
aws sns subscribe --topic-arn $TOPIC_ARN --protocol email \
  --notification-endpoint you@example.com
# Check your inbox and click "Confirm subscription"
echo "$TOPIC_ARN"

Expected: an ARN like arn:aws:sns:ap-south-1:111122223333:obs-lab-alerts, and a confirmation email.

Step 3 — Publish a custom metric.

aws cloudwatch put-metric-data \
  --namespace "ObsLab" --metric-name QueueDepth \
  --unit Count --value 5

Validation: aws cloudwatch list-metrics --namespace ObsLab should list QueueDepth within a minute or two (custom metrics can take a moment to appear).

Step 4 — Create an alarm that pages on a deep queue.

aws cloudwatch put-metric-alarm \
  --alarm-name obs-lab-queue-deep \
  --namespace ObsLab --metric-name QueueDepth \
  --statistic Maximum --period 60 \
  --evaluation-periods 1 --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions "$TOPIC_ARN"

Step 5 — Drive the metric over the threshold and watch it alarm.

aws cloudwatch put-metric-data --namespace ObsLab --metric-name QueueDepth --value 50
# wait a minute, then:
aws cloudwatch describe-alarms --alarm-names obs-lab-queue-deep \
  --query 'MetricAlarms[0].StateValue' --output text

Expected: the state moves to ALARM and you receive an SNS email. Push a low value (--value 1) to see it return to OK.

Step 6 — Create a log group, log an event, and query with Logs Insights.

aws logs create-log-group --log-group-name /obs-lab/app
aws logs put-retention-policy --log-group-name /obs-lab/app --retention-in-days 1
STREAM=run-1
aws logs create-log-stream --log-group-name /obs-lab/app --log-stream-name $STREAM
TS=$(($(date +%s)*1000))
aws logs put-log-events --log-group-name /obs-lab/app --log-stream-name $STREAM \
  --log-events timestamp=$TS,message='{"level":"ERROR","status":500,"path":"/checkout"}'

Now run a Logs Insights query (console: CloudWatch → Logs Insights → select /obs-lab/app), or from the CLI:

QID=$(aws logs start-query --log-group-name /obs-lab/app \
  --start-time $(($(date +%s)-3600)) --end-time $(date +%s) \
  --query-string 'fields @timestamp, status, path | filter status = 500 | sort @timestamp desc' \
  --query queryId --output text)
sleep 5
aws logs get-query-results --query-id "$QID"

Expected: the 500 event comes back with its status and path fields extracted from the JSON.

Step 7 — Confirm the CloudTrail event history (no trail needed).

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=PutMetricAlarm \
  --max-results 5 --query 'Events[].Username'

Expected: your identity appears as the user who created the alarm in Step 4 — the who did what, free and always on.

Cleanup.

aws cloudwatch delete-alarms --alarm-names obs-lab-queue-deep
aws logs delete-log-group --log-group-name /obs-lab/app
aws sns delete-topic --topic-arn "$TOPIC_ARN"
# custom metrics expire automatically (15 months) and cannot be deleted manually

Cost note. Within Free Tier this lab is effectively free. Outside it: custom metrics are billed per metric/month, alarms per alarm/month, logs per GB ingested and stored (set retention!), Logs Insights per GB scanned, and SNS email is free for the first 1,000 notifications. The single biggest real-world cost trap here is log groups left on “Never expire” — always set a retention policy.

Common mistakes & troubleshooting

Observability problems are mostly self-inflicted configuration gaps. Use this as a symptom → root cause → confirm → fix playbook; the Confirm column is the exact command or console path that proves it before you change anything.

#	Symptom	Likely root cause	Confirm (command / path)	Fix
1	Dashboard / alarm / metrics empty	Wrong Region (CloudWatch is regional)	Check console Region selector; `echo $AWS_DEFAULT_REGION`	Switch Region; set widget Region per panel
2	No memory / disk metrics for EC2	Not default metrics; agent not installed	`aws cloudwatch list-metrics --namespace CWAgent` is empty	Install CloudWatch agent + `CloudWatchAgentServerPolicy`
3	Alarm stuck in `INSUFFICIENT_DATA`	Metric stopped reporting; missing-data = `missing`	`describe-alarms` shows the state; source resource stopped	Set `--treat-missing-data` (`breaching` for heartbeats)
4	Alarm flaps on single spikes	Evaluation periods = 1	`describe-alarms --query 'MetricAlarms[].EvaluationPeriods'`	Use M-of-N (e.g. 3 of 5 datapoints)
5	Paged four times for one outage	No composite alarm; every child pages	Multiple correlated alarms all in ALARM	Wrap children in a composite; suppress child actions
6	CloudTrail shows no S3 object reads	Object access is a data event, off by default	Trail config shows no data event selectors	Enable S3 data events on the trail (mind cost)
7	Missing IAM / CloudFront / Route 53 events	Global events log via `us-east-1`; trail single-Region	`describe-trails --query 'trailList[].IsMultiRegionTrail'` is false	Use a multi-Region trail
8	CloudWatch bill creeping up	High-cardinality custom metrics; never-expire logs; data events	Billing → Cost Explorer by usage type	Cut dimensions; set log retention; scope data events
9	Config rule says NON_COMPLIANT, nothing fixes it	Config evaluates, does not remediate	Rule shows NON_COMPLIANT, no remediation attached	Attach SSM Automation or wire EventBridge → Lambda
10	Config rule reports nothing for a resource	Recorder off, or scope excludes the type	`describe-configuration-recorder-status` recording=false	Turn recorder on; `allSupported` + global resources
11	Metric filter never increments	Filter created after the events; only new events count	No data points on the metric since creation	Re-test with a fresh matching log line
12	EventBridge rule never fires	Event pattern does not match the real event shape	`aws events test-event-pattern --event-pattern ... --event ...`	Fix the JSON pattern against a real sample event
13	Logs Insights query is slow / costly	Scanning too many groups / too wide a time range	Query stats show GB scanned	Narrow time range and log-group selection
14	Auto-remediation loops or fights a deploy	Non-idempotent runbook; no exception path	CloudTrail shows the fix firing repeatedly	Make runbook idempotent; honour an exception tag

Best practices

Use the triad on purpose. CloudWatch for health, CloudTrail for audit, Config for state/compliance — don’t try to make one do another’s job.
Set log retention on every log group. Never-expire is the default and the number-one observability cost trap.
Alarm with M-out-of-N and sensible missing-data treatment. Reduce paging noise; use composite alarms so on-call is paged only when signals agree.
Wire OK actions, not just ALARM actions. You want the all-clear too.
Install the CloudWatch agent on every EC2 instance so memory and disk-space metrics exist before you need them at 3am.
Enable a multi-Region, organisation CloudTrail with log-file validation delivered to a locked-down, separate-account S3 bucket — your tamper-evident audit record.
Turn on AWS Config across all accounts/Regions with an aggregator and a conformance pack for your baseline; add auto-remediation for the high-value rules.
Automate remediation through EventBridge — Config/CloudTrail/CloudWatch event → EventBridge rule → Lambda/SSM → SNS.
Manage dashboards, alarms, trails, rules and conformance packs as code (CloudFormation/CDK/Terraform) so they are reviewed and reproducible — don’t click-ops production observability.
Publish business metrics and structured (JSON) logs so Logs Insights and metric filters can do real work; extend this to frontend SLOs with CloudWatch RUM, Synthetics & Canaries for Frontend SLO Monitoring.

Security notes

Observability is a security control surface; lock it down accordingly.

Control	What to do	Why it matters
Protect the audit trail	Deliver CloudTrail to a dedicated log-archive account bucket with Block Public Access, deletion-restricting bucket policy, Object Lock (WORM), and log-file validation	Stops an attacker (or a mistake) erasing the evidence
Use an organisation trail	Create it in the management account so members cannot disable it	Guarantees every account is covered
Alarm on security events	Route CloudTrail → CloudWatch Logs, add metric filters + alarms for root login, sign-in failures, IAM/SG/CloudTrail changes (the CIS set)	Turns the audit log into real-time detection
Least privilege on dangerous actions	Restrict and alarm on `cloudwatch:PutMetricData`, `logs:DeleteLogGroup`, `cloudtrail:StopLogging`	These poison metrics, destroy evidence, or blind you
Encrypt logs and trails	Associate a KMS CMK with log groups and the trail; control bucket read access	Protects sensitive data at rest; gates who can read
Continuous posture	Feed Config + Security Hub findings into EventBridge for automated response	Closes the loop from detection to remediation
Real-time reaction path	Consume CloudTrail via EventBridge, not S3 delivery, for anything time-sensitive	S3 delivery is ~15 min — too slow for live response

For the deeper audit build, see CloudTrail & Config for Audit & Compliance.

Cost & sizing — the levers that move the bill

The observability bill is driven by volume, and a few levers control it. Know the unit you are billed in for each service and where the cost actually concentrates:

Service	Billed per	The cost trap	The lever
CloudWatch metrics	Custom metric / month + API requests	Per-user / per-request dimensions explode count	Aggregate dimensions; publish fewer
Detailed / high-res metrics	Per-metric premium	1-min / 1-s everywhere	Use only where sub-minute matters
CloudWatch Logs	GB ingested + GB stored	Never-expire retention	Set retention; Infrequent-Access class
Logs Insights	GB scanned per query	Wide time ranges over all groups	Narrow time + group selection
Dashboards	Per dashboard / month (after 3)	Many one-off dashboards	Consolidate; define as code
CloudTrail	Per event (mgmt first copy free)	Data events at S3/Lambda scale	Scope data events to key buckets
AWS Config	Per CI recorded + per rule eval	High-churn resources, broad recording	Record selectively; tune rule triggers
X-Ray	Per trace recorded + scanned	Full-rate tracing	Lower sampling; trace what matters

Rough INR/USD intuition at small scale: a single account with a dozen custom metrics, ten alarms, a few GB of logs at 14-day retention, a multi-Region management trail, Config on with a handful of managed rules, and modest EventBridge traffic typically lands in the low single-digit USD per month (a few hundred INR) — dominated by Config CIs and any data events you turn on. The pattern: turn on broad recording for security/audit (CloudTrail management events, Config) where it is cheap or required, and be deliberate about the high-volume items (data events, high-cardinality custom metrics, full-rate tracing).

Interview & exam questions

What is the difference between CloudTrail and CloudWatch? CloudTrail records API activity — who did what (audit/governance). CloudWatch records operational telemetry — metrics, logs, alarms, dashboards — what is happening with your resources (monitoring). Different jobs. (SAA-C03, SOA-C02)
CloudTrail vs AWS Config — when do you use each? CloudTrail records the event (the API call that changed something). Config records the resulting configuration state and its history, and evaluates it against rules. “Who deleted the SG?” → CloudTrail. “What did the SG look like last week and is it compliant?” → Config. (SAA-C03, SCS-C02)
Why don’t I see memory or disk-space metrics for my EC2 instance? They are not default metrics — the hypervisor cannot see inside the guest OS. Install the CloudWatch agent to collect them. Detailed monitoring only changes resolution (5 min → 1 min), it does not add memory/disk. (SOA-C02)
What is a composite alarm and why use one? An alarm whose state is a boolean expression over other alarms, used to cut alarm noise — page only when multiple signals agree, or suppress dependent alarms. It cannot perform EC2/Auto Scaling actions, only notifications. (SOA-C02)
Explain period, evaluation periods, and datapoints to alarm. Period = length of each data point; evaluation periods = how many recent periods to consider; datapoints to alarm = how many of those must breach. Together they give M-out-of-N (e.g. 3 of 5) to suppress single-spike flapping. (SOA-C02)
CloudTrail management vs data vs Insights events? Management = control-plane operations (logged by default, one free trail copy). Data = high-volume data-plane operations like S3 GetObject / Lambda Invoke (off by default, charged). Insights = detected unusual activity in event volume (off by default, charged). (SCS-C02, SOA-C02)
I enabled CloudTrail but can’t see who read an S3 object — why? Object reads/writes are data events, which are off by default. Enable S3 data events on the trail (they cost money at scale). (SCS-C02)
How do you alarm on a pattern in your logs (e.g. “more than 5 errors a minute”)? Create a metric filter on the log group that increments a metric when the pattern matches, then put a CloudWatch alarm on that metric. (Metric filters only apply to new events.) (SOA-C02, DVA-C02)
How do you query terabytes of logs ad-hoc without exporting them? CloudWatch Logs Insights — a query language (fields/filter/parse/stats/sort) billed per GB scanned, so narrow the time range and log groups. (SOA-C02, DVA-C02)
What is the relationship between EventBridge and CloudWatch Events? They are the same service; EventBridge is the current name and superset (custom buses, SaaS partners, schema registry, Pipes, Scheduler). APIs are compatible. (DVA-C02, SAA-C03)
How would you auto-remediate a non-compliant resource? AWS Config rule detects NON_COMPLIANT → attach an SSM Automation remediation, or route the Config event through EventBridge to a Lambda/SSM action, and notify via SNS. (SCS-C02, SOA-C02)
You need a tamper-evident, multi-account audit log retained for years — what do you build? A multi-Region organisation CloudTrail with log-file validation, delivered to a dedicated log-archive account S3 bucket with Block Public Access and Object Lock; query with CloudTrail Lake or Athena. (SCS-C02)

Quick check

Which service answers “who deleted this resource”?
What does “datapoints to alarm = 3, evaluation periods = 5” mean?
Are S3 object-level reads captured by CloudTrail by default?
What is the default retention for a new CloudWatch log group?
Which service records a timeline of a resource’s configuration and evaluates compliance rules?

Answers

CloudTrail (the who did what; for the resulting state over time you’d use AWS Config).
Alarm if 3 of the last 5 evaluation periods breach the threshold (the M-out-of-N pattern that suppresses single spikes).
No — object-level access is a data event and is off by default; you must enable S3 data events on the trail.
Never expire — which is why you should always set an explicit retention policy.
AWS Config.

Glossary

Metric — a time-ordered series of numeric data points, identified by namespace + dimensions.
Namespace — a container that groups related metrics (AWS/EC2, MyApp/Checkout).
Dimension — a name/value pair scoping a metric to a resource; part of the metric’s identity.
High-resolution metric — a custom metric at 1-second granularity (vs standard 1-minute).
Alarm — a watcher on a metric/expression with OK / ALARM / INSUFFICIENT_DATA states and actions.
Composite alarm — an alarm whose state is a boolean expression over other alarms (for noise reduction).
M-out-of-N — alarm only when M of the last N evaluation periods breach (suppresses single-spike flapping).
Metric filter — a pattern that turns matching log events into a CloudWatch metric.
Subscription filter — streams matching log events in near real time to Kinesis/Firehose/Lambda.
Logs Insights — CloudWatch’s interactive query language over log data, billed per GB scanned.
CloudWatch agent — an in-OS binary that collects memory/disk metrics and ships log files.
CloudTrail trail — a config that delivers API-activity events to S3 (and optionally CloudWatch Logs/EventBridge).
Management / data / Insights events — control-plane (default) / high-volume data-plane (opt-in) / anomaly (opt-in) CloudTrail event categories.
Configuration item (CI) — AWS Config’s point-in-time snapshot of a resource’s state and relationships.
Config rule — a desired-state check marking resources COMPLIANT or NON_COMPLIANT.
Conformance pack — a deployable bundle of Config rules + remediation for a compliance standard.
EventBridge — the serverless event bus (formerly CloudWatch Events) routing events to targets.
Event pattern — the JSON matcher on an EventBridge rule.
X-Ray segment/trace — the unit of work for one service / the full path of one request across services.

Next steps

Extend monitoring to the frontend and SLOs — real-user monitoring, canaries and synthetic checks — in CloudWatch RUM, Synthetics & Canaries for Frontend SLO Monitoring.
Build the central log pipeline — subscription filters to Firehose to OpenSearch — in Structured Logging Pipeline on AWS.
Go deeper on event-driven automation in EventBridge Event-Driven Architecture: Buses, Schema & Pipes.
Add distributed tracing with AWS X-Ray: Service Map, Segments & ADOT Tracing on EKS.
Put it to work under pressure with the AWS Troubleshooting Methodology for EC2, VPC, IAM, S3 & Lambda.