Advanced AWS Troubleshooting: Complex Multi-Service Incidents & Root-Cause Analysis

A single-service problem is a puzzle; a multi-service incident is a crime scene. When an EC2 instance will not accept SSH you can reason about it in isolation — security group, key, subnet, status checks — and the previous lesson gave you those playbooks. But production rarely fails that politely. Real incidents arrive as a payments API at 30% error rate, and the cause turns out to be a DynamoDB table that started throttling because a deploy three hours ago doubled its read pattern, which exhausted the connection pool in a downstream Lambda, which tripped a circuit breaker in the upstream service, which is why the checkout page — three hops away — is timing out. Nobody changed checkout. The symptom and the cause are in different services, owned by different teams, and the only thing that ties them together is a trace ID and a timeline.

This lesson is about that kind of incident. The skill is no longer “do I know this service” — it is correlation under pressure: imposing a lifecycle on the chaos, reading three or four observability tools against one clock, forming hypotheses you can cheaply disprove, and writing the whole thing down afterwards so it never recurs. We will walk the incident-response lifecycle, the correlation toolkit (CloudWatch metrics and Logs Insights, CloudTrail, X-Ray, the Health Dashboard, Trusted Advisor), five fully worked complex scenarios as symptom → hypotheses → cross-service diagnosis → fix, the role of Service Quotas, and the blameless postmortem (Amazon’s COE) that closes the loop. This is the SOA-C02 and SAP-C02 operational-excellence mindset, written the way an on-call architect actually works it.

Because this is a reference you will return to mid-incident, the playbook itself, the error codes, the golden-signal shapes, the quota ceilings and the scenario fingerprints are all laid out as scannable tables — read the prose once, then keep the tables open at 02:14. By the end you will stop staring at the failing service. You will localise a symptom to a hop on the request path, classify the failure shape in two minutes, and walk to the cause even when it lives in a service you did not touch.

What problem this solves

In a monolith, the failing function and the failing user are in the same process; a stack trace points at the line. In a distributed system the request fans across CloudFront, an ALB, several compute tiers, a queue, a database and three AWS-managed control planes — and the failure you see is almost never co-located with the failure that caused it. The 5xx surfaces in checkout; the throttle is on one DynamoDB partition; the trigger is a feature flag on a third service. Without a method, three engineers stare at the checkout dashboard while nobody talks to the customer, nobody mitigates, and nobody writes down what actually happened.

What breaks without this discipline: incidents run long because investigation has tunnel vision, mitigations are skipped in a rush to “understand it first,” signals get compared across mismatched clocks so people prove the wrong thing, and the postmortem names a person instead of the systemic gap — so the same outage recurs next quarter. The cost is measured in revenue per minute of customer impact and in the slow erosion of trust when the team cannot say why the last outage happened.

Who hits this: anyone running a real distributed system on AWS — SREs, on-call engineers, platform teams, and the architects who design the blast-radius. It bites hardest on teams without instrumentation for correlation (no request/trace IDs in logs, X-Ray off the critical path, no quota alarms), on multi-account organisations where an SCP edit in the management account can deny actions everywhere, and on cost-sensitive stacks that run a single AZ or a too-low Lambda reserved-concurrency. The fix is almost never “add more compute” — it is “impose a lifecycle, read four tools on one clock, and find the hop that is lying.”

To frame the whole field before the deep dive, here is every complex-incident class this lesson covers, the golden-signal shape that fingerprints it, the first tool to open, and the single most common root cause:

Incident class	Golden-signal shape	First question	First tool to open	Most common single cause
Cascading throttle (S1)	Latency up first, then errors, traffic flat	Is a downstream slow or rejecting?	X-Ray service map	A flag flipped `Query`→`Scan`, hot-partition throttle
AZ impairment (S2)	~1/3 errors, 2/3 fine, persistent	One AZ or one bad host?	ALB metrics BY AZ + PHD	A single Availability Zone degraded
Quota / limit (S3)	Errors track traffic, vanish at rest	Is this a bug or a ceiling?	Service Quotas + Throttle metrics	Concurrency / API-rate / ENI quota hit
IAM / SCP blast radius (S4)	`AccessDenied` across many services, one minute	Which policy layer denies?	CloudTrail (mgmt account)	An SCP/KMS/role edit with a wide reach
DNS / cert (S5)	Dashboards green, users can’t load	Is the failure upstream of me?	`dig +trace` / `openssl s_client`	ACM renewal lapsed (validation CNAME gone)

Learning objectives

By the end of this lesson you will be able to:

Run an incident through a disciplined lifecycle — detect, triage, communicate, mitigate, perform root-cause analysis (RCA), prevent — and know what “done” means at each stage.
Correlate CloudWatch metrics, CloudWatch Logs Insights queries, CloudTrail events and X-Ray traces against a single UTC timeline to locate a cause that is not in the failing service.
Read the four golden signals (latency, traffic, errors, saturation) as shapes that classify a failure before you open a single log line.
Diagnose and fix five classes of complex incident: cascading failure from a throttled dependency, Availability Zone (AZ) impairment, a service-quota breach under load, an IAM/SCP change with a wide blast radius, and a Route 53/DNS or ACM certificate failure.
Tell the difference between a mitigation (stop the bleeding now) and a fix (remove the cause) and sequence them correctly.
Use Service Quotas, Trusted Advisor and the AWS Health Dashboard proactively so the next breach is a CloudWatch alarm, not an outage.
Write a blameless Correction of Error (COE) with a real five-whys, contributing factors and CAPA action items.

Prerequisites & where this fits

You should be comfortable with the single-service troubleshooting method and playbooks from AWS Troubleshooting Playbooks: EC2, VPC, IAM, S3 & Lambda — reproduce, isolate the layer, check config against desired state, inspect CloudWatch and CloudTrail, hypothesise, fix, verify, prevent. You should know IAM policy evaluation (explicit deny beats allow beats implicit deny) from IAM Fundamentals: Users, Roles, Policies & Evaluation, how VPC routing and security groups work, and roughly what CloudWatch, CloudTrail and X-Ray each record — the depth is in CloudWatch & CloudTrail Observability Deep Dive and X-Ray: Service Map, Segments & ADOT Tracing.

This lesson sits in the Troubleshooting & operations module of the Zero-to-Hero track, immediately after the single-service playbooks and before the architecting ladder. It is mapped to SOA-C02 (AWS Certified SysOps Administrator) and the operational sections of SAP-C02 (Solutions Architect Professional). The assumed knowledge, and where to brush it up if it’s rusty:

You should know…	Why it matters here	Brush up with
Single-service troubleshooting method	Cross-service RCA builds on per-layer isolation	Troubleshooting Playbooks: EC2, VPC, IAM, S3 & Lambda
IAM policy evaluation (explicit deny wins)	Scenario 4 walks the evaluation chain	IAM Fundamentals: Users, Roles, Policies & Evaluation
What CloudWatch / CloudTrail record	They are two of your four correlation tools	CloudWatch & CloudTrail Observability Deep Dive
Distributed tracing basics	The X-Ray service map finds the fault edge	X-Ray: Service Map, Segments & ADOT Tracing
VPC routing, security groups, AZs	AZ impairment and connectivity reasoning	VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints

A quick map of who confirms what during a cross-service incident, so you page the right owner fast:

Layer / plane	What lives here	Who usually owns it	Failure classes it can cause
Edge / DNS (Route 53, CloudFront, ACM)	Resolution, TLS, edge routing	Frontend / SRE	DNS misroute, expired cert (S5) — dashboards stay green
Ingress (ALB/NLB, WAF)	L7 routing, health, AZ spread	Network team	Per-AZ 5xx (S2), WAF 403, listener/cert errors
Compute (ECS/EKS, Lambda, EC2)	Your services, pools, concurrency	App / platform team	Cascades (S1), concurrency throttles (S3), crashes
Data (DynamoDB, RDS, ElastiCache)	Stateful tier, capacity	Data / platform team	Throttles (S1), AZ-pinned primary (S2), saturation
Identity / policy (IAM, SCP, KMS)	Permissions above the app	Security / cloud platform	Wide `AccessDenied` blast radius (S4)
Control / correlation (CloudWatch, CloudTrail, X-Ray)	The evidence plane	SRE / everyone	Not a cause — your instrument; blind spots mislead

Core concepts

Five mental models make every later diagnosis obvious.

The symptom and the cause live in different services. The defining property of a multi-service incident is non-locality: the 5xx is in checkout, the throttle is on a DynamoDB partition, the trigger is a flag on a third service. So you never trust the failing service as the suspect. You start from the request path, find the failing edge (X-Ray), and walk toward the cause. “Where it hurts” and “what’s wrong” are different questions.

You read four tools against one clock. CloudWatch metrics tell you what changed and when; Logs Insights tells you what happened in a service; CloudTrail tells you who touched the control plane; X-Ray tells you where in the call graph the time and errors go. None of them is sufficient alone, and they only correlate if you put every console and query in UTC with the same time window. The single most common diagnostic error is comparing a log in local time against a metric in UTC and “proving” the wrong thing.

The golden-signal shape classifies the failure before you read a log. Latency, traffic, errors and saturation are the four golden signals, and their relative movement is a fingerprint. Errors up, latency flat, traffic flat is a fast-rejecting dependency. Latency up first, then errors, traffic flat is saturation or a slow dependency. Traffic up, then latency/errors up is load hitting a ceiling. A step change at an exact timestamp means something changed — go to CloudTrail for that minute. Reading the shape buys you the right hypothesis in the first two minutes.

Mitigation and fix are different jobs done in a fixed order. A mitigation restores service now (roll back, fail over, raise a quota, shed load, flip a flag); a fix removes the cause (correct the access pattern, repair the policy, restore ACM validation). During customer impact you mitigate first — the telemetry keeps the evidence — then diagnose and fix from a calm seat. Trying to find root cause before acting is the classic junior mistake that turns a five-minute incident into a two-hour one.

Most incidents are preventable hygiene, not fate. Three of the five scenarios below (quota, AZ, DNS/cert) are converted from outages into Tuesday-afternoon tickets by present-day alarms: a quota alarm at 80%, a per-AZ healthy-host alarm, a DaysToExpiry alarm, a CloudTrail alarm on sensitive policy events. “Convert future incidents into present-day alarms” is the whole prevention thesis.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters to multi-service RCA
Incident Commander (IC)	The single coordinator; decides, assigns	Org / on-call rotation	One brain prevents three-theory thrash
Mitigation	Restore service now (rollback/failover/quota)	The incident bridge	Stops the bleeding before you understand it
Fix	Remove the underlying cause	After mitigation	The thing that stops recurrence
RCA	Disciplined search for the systemic cause	Post-mitigation	Finds cause, not just trigger
Golden signals	Latency, traffic, errors, saturation	CloudWatch	Their shape classifies the failure
Correlation	Lining up signals on one UTC clock	All four tools	The core multi-service move
Blast radius	Breadth of impact of one change	IAM/SCP/KMS/DNS	Widest for policy edits
Cascading failure	Small hiccup amplified by retries	Service-to-service edges	Turns a blip into an outage
Service Quota	An account/Region limit (rate, count)	Service Quotas console	A capacity input, not an incident — if monitored
AZ impairment	A fault confined to one AZ	One Availability Zone	Survivable with cross-AZ headroom
COE	Amazon’s blameless postmortem format	After the incident	Closes the loop with CAPA items
CAPA	Owned, dated corrective/preventive actions	Tracker	An action without an owner is a wish

The incident-response lifecycle

Under pressure, ad-hoc investigation produces tunnel vision: three engineers staring at the same dashboard while nobody talks to the customer or writes anything down. A lifecycle is the antidote. It is not bureaucracy — it is the set of jobs that must happen concurrently, and naming them lets you assign them.

Phase	Goal	Key actions	Done when
Detect	Know there is an incident, fast	Alarms (CloudWatch composite + SLO burn-rate), synthetic canaries, customer reports, Health Dashboard events	An incident is declared with a severity and a single owner
Triage	Size the blast radius and assign roles	Confirm scope (one customer / one AZ / one Region / global), set severity, name an Incident Commander (IC), a comms lead and an ops lead	Everyone knows their role and the severity is agreed
Communicate	Keep stakeholders ahead of the rumour	Status page / internal channel updates on a cadence (e.g. every 30 min for Sev-1), business-impact statement	Stakeholders are updated and the cadence is running
Mitigate	Stop customer pain before you understand it fully	Roll back, fail over, raise a quota, shed load, disable a feature flag — anything that restores service	Customer impact is gone or sharply reduced
RCA	Find the true cause, not the trigger	Correlate logs/metrics/traces/CloudTrail on one timeline, five-whys	The causal chain is written down and agreed
Prevent	Make recurrence impossible or detectable	CAPA action items with owners and dates, new alarms, guardrails, runbook updates	Action items are tracked to completion

Two principles separate seniors from juniors here. First: mitigate before you fully diagnose. The instinct to find the cause before acting is wrong during customer impact — if a rollback or failover restores service, do it, then investigate from a calm seat. The cause is still in the telemetry. Second: one Incident Commander. The IC does not debug; they coordinate, decide, and protect the responders from interruptions. The fastest incidents I have run had an IC who never touched a keyboard.

The three incident roles, and the trap when one person tries to wear two of them:

Role	Owns	Explicitly does NOT	Failure mode if merged
Incident Commander	Decisions, role assignment, severity, the bridge	Debug, type commands	If the IC debugs, nobody coordinates → tunnel vision
Comms lead	Status page, exec/stakeholder updates on cadence	Diagnose, decide mitigations	If merged with IC, comms slip while debugging
Ops lead (responder)	Run the diagnosis and mitigations	Decide scope/severity, talk to execs	If responders self-coordinate, three theories, no progress

A note on severity: pick a simple scale and stick to it. Severity drives the comms cadence and who gets paged — it is a routing decision, not a judgement of blame.

Severity	Definition	Page / staffing	Comms cadence	Example
Sev-1	Major customer-facing outage	All hands, IC + exec comms	Every 30 min	Checkout down org-wide
Sev-2	Significant degradation or single-AZ/feature impact	On-call + IC	Every 60 min	One AZ impaired, N+1 absorbs it
Sev-3	Minor or internal-only impact	On-call engineer	At resolution	A non-critical dashboard stale
Sev-4	Cosmetic / no customer impact	Backlog ticket	None	A typo in an internal runbook

The mitigate-then-fix sequencing is the move juniors get wrong most. Mitigations are fast and reversible and stop pain; fixes are slower and remove the cause. The same incident usually has both, and you do them in that order:

Incident class	Mitigation (now, reversible)	Fix (later, removes cause)	Why the order matters
Throttle cascade (S1)	Flip the flag off; table → on-demand	Restore Query/GSI; add circuit breaker	Flag-off stops bleeding in seconds; code fix takes a sprint
AZ impairment (S2)	Fail traffic away from the AZ	Add cross-AZ headroom; test failover	You can’t “fix” an AZ — AWS does; you fix your tolerance
Quota breach (S3)	Request increase; shed/queue load	Right-size reservations; alarm at 80%	Increase is minutes; capacity planning is ongoing
IAM/SCP (S4)	Revert the policy edit	Re-apply the intent narrowly via pipeline	Revert restores access now; correct scoping needs review
DNS / cert (S5)	Revert record; deploy a valid cert	Restore ACM auto-renew; records-as-code	Cert/record swap is immediate; renewal hygiene is durable
Deploy regression	Roll back to the last good version	Fix forward and re-deploy through CI	Rollback is the fastest path to green

The correlation toolkit: reading four tools against one clock

The defining move of multi-service RCA is correlation. The cause lives in a different service from the symptom, so you must line up signals from several tools on a single, UTC timeline. Configure every console and query to UTC during an incident — the single most common diagnostic error is comparing a log in local time against a metric in UTC and “proving” the wrong thing.

Here is what each tool is for, and crucially what it is not for:

Tool	Answers	Best for	Blind spot
CloudWatch metrics	“What changed, and when?”	Spotting the inflection point — latency, errors, throttles, saturation	Aggregates hide per-request detail; high cardinality is expensive
CloudWatch Logs Insights	“What exactly happened in this service?”	Querying structured logs across log groups for patterns, error messages, request IDs	Only as good as your logging; cross-service joins are manual
CloudTrail	“Who did what to the control plane, and when?”	Tying an incident to a config change, deploy, IAM/SCP edit, quota change	Management events only by default; data events cost extra; ~minutes of delivery latency
X-Ray (traces)	“Where in the call graph is the time/error going?”	Following one request across services; the service map shows the failing edge	Sampled by default — the one trace you want may not be captured
AWS Health Dashboard	“Is this AWS’s problem?”	Ruling in/out an AWS-side event in your account/Region (PHD shows account-specific)	Not real-time to the second; absence of an event is not proof
Trusted Advisor	“Am I near a limit or misconfigured?”	Pre-incident hygiene and quota headroom checks	Periodic, not a live diagnostic

Before the shapes, the concrete metrics you actually overlay — which service, which metric, which golden signal it represents, and the number that means trouble:

Service	Metric	Golden signal	“Bad” reading	Pairs with
ALB	`TargetResponseTime`	Latency	p99 climbing toward the LB timeout	`HTTPCode_Target_5XX_Count`
ALB	`HTTPCode_ELB_5XX_Count`	Errors	Non-zero (LB itself, not target)	`RequestCount`, target health
ALB	`HealthyHostCount` (by AZ)	Saturation	Drops in one AZ	per-AZ `5XX` concentration
Lambda	`Throttles`	Errors	Rising while `ConcurrentExecutions` flat	`Invocations`
Lambda	`ConcurrentExecutions`	Saturation	Flatlines at 1,000 / reserved	`Throttles`
DynamoDB	`ReadThrottleEvents` / `WriteThrottleEvents`	Errors	Any non-zero	`ConsumedReadCapacityUnits`
DynamoDB	`SuccessfulRequestLatency`	Latency	Spiking on one operation	throttle events
RDS	`DatabaseConnections`	Saturation	Near `max_connections`	`CPUUtilization`, `ReadLatency`
ECS / EKS	`CPUUtilization` / pool depth	Saturation	Pool exhausted, CPU pinned	upstream latency
Service Quotas	`AWS/Usage` (per quota)	Saturation	> 80% of applied value	the throttle metric it gates
API Gateway	`5XXError` / `Latency`	Errors / Latency	Spiking on one stage/route	integration (Lambda) errors
SQS	`ApproximateAgeOfOldestMessage`	Saturation	Climbing — consumer not draining	consumer `Throttles`/errors

The high-leverage skill is the golden-signal sweep: in the first two minutes, look at latency, traffic, errors and saturation (the four golden signals) for the affected service and its immediate dependencies, all on the same time window. The shape tells you the class of problem before you read a single log line. Here is the full shape-to-class map you scan first:

Latency	Traffic	Errors	Saturation	It’s probably…	Go straight to
Flat	Flat	Up	Flat	A dependency rejecting fast (throttle/4xx/bad deploy)	The dependency’s throttle metrics
Up first	Flat	Up after	Rising	Saturation or a slow dependency; queue fills, then timeouts	X-Ray service map, then pool/queue depth
Up	Up	Up	Rising	Load-driven — a capacity or quota ceiling	Service Quotas + concurrency/throttle metrics
Step	Flat	Step	Flat	Something changed at an exact minute	CloudTrail `lookup-events` for that minute
Flat	Flat	Up (one third)	Flat in 2 AZs	Single-AZ impairment	ALB metrics BY AZ-ID + PHD
Flat	Flat	Flat (your view)	Flat	Failure is upstream (DNS/cert/edge)	`dig +trace`, `openssl s_client`

A reading note that saves the most time — the difference between who emitted an error decides which logs you open:

Question	The trap	How to tell
Did my service return the 5xx, or a layer in front?	An hour in the wrong logs	If X-Ray/logs show the request succeeding (slowly) but the client got 502, the ALB/CloudFront emitted it on timeout
Is the throttle in my app or in AWS?	“Scale up” masks it	An AWS throttle names the service in the error (`ThrottlingException`, `ProvisionedThroughputExceeded`); an app limit does not
Did a human change something, or did infra fail?	Chasing a config ghost	CloudTrail quiet at the inflection minute is a strong signal the cause is infrastructure (AZ), not a change

Logs Insights and CloudTrail: the two queries you will type most

When errors spike, this Logs Insights query finds the dominant failure mode fast — bin by minute, count by status, and you see both the onset and the breakdown:

fields @timestamp, @message
| filter status >= 500 or @message like /Throttl|Timeout|ProvisionedThroughputExceeded/
| stats count(*) as errors by bin(1m), errorType
| sort errors desc

And when a metric shows a step change at a specific minute, this CloudTrail lookup answers “what changed”:

# Who/what touched the control plane around the inflection point (UTC)
aws cloudtrail lookup-events \
  --start-time 2026-06-15T14:25:00Z \
  --end-time   2026-06-15T14:35:00Z \
  --query 'Events[].{Time:EventTime,User:Username,Event:EventName,Src:EventSource}' \
  --output table

Pair a step change in a metric with a CloudTrail event in the same minute and you have usually found your trigger. For request-level correlation, propagate a request ID (and the X-Ray trace ID) through your logs so you can pivot from a metric, to the traces on that edge, to the exact log lines for that request. The exact CLI / console moves for each correlation step:

Step	Tool	Exact command / path	What it proves
Find the inflection minute	CloudWatch	Metrics → set timezone UTC → note the step	When it started
See the failing edge	X-Ray	Service map → click the red edge → trace list	Where in the call graph
Break down the errors	Logs Insights	the `stats count(*) by bin(1m), errorType` query	The dominant failure mode
Tie to a change	CloudTrail	`aws cloudtrail lookup-events --start-time … --end-time …`	Who/what changed at that minute
Confirm AWS-side	Health Dashboard	PHD → events for your account/Region	Whether it’s AWS, not you
Pivot to one request	Logs + X-Ray	filter logs by `reqId` / `trace_id`	The exact request’s path and lines

The error & throttle reference

Before the worked scenarios, here is the lookup table you scan when an error string appears: the codes and exceptions you realistically see across services during a cross-service incident, what each means, the likely cause, how to confirm it, and the first fix. The non-obvious ones are the throttle exceptions (they name the service and scale with load) and the difference between a 502 the ALB emitted and a 5xx your app emitted.

Code / exception	Where it appears	Likely cause	How to confirm	First fix
`ProvisionedThroughputExceededException`	DynamoDB SDK / logs	Read/write past table or partition capacity	`ReadThrottleEvents`/`WriteThrottleEvents` non-zero; hot partition	On-demand or raise capacity; fix the access pattern
`ThrottlingException` / `Rate exceeded`	Many AWS APIs	Account/API request-rate quota hit under load	Service Quotas usage metric vs applied value	Backoff+jitter; request quota increase
`TooManyRequestsException` (Lambda 429)	Lambda invoke	Concurrency limit (account 1,000 or reserved)	`Throttles` up while `ConcurrentExecutions` flat	Raise reserved concurrency; queue with SQS
`502 Bad Gateway`	ALB / CloudFront	Target gave no/bad answer, or upstream timeout	App trace shows request succeeding slowly → LB emitted it	Speed up target; raise idle/keep-alive; fix health
`503 Service Unavailable`	ALB	No healthy target in the target group	Target group `HealthyHostCount` = 0 (maybe one AZ)	Restore healthy targets; check per-AZ
`504 Gateway Timeout`	ALB / API Gateway	Backend slower than the LB/integration timeout	Backend p99 climbing toward the timeout value	Speed up backend; raise timeout to match
`AccessDenied`	S3/KMS/STS/most APIs	IAM/SCP/resource-policy/KMS-key-policy deny	CloudTrail `errorCode: AccessDenied`; trace the eval chain	Revert the offending policy edit
`AccessDeniedException` (KMS)	KMS-backed ops	Key policy/grant removed `Decrypt`/`GenerateDataKey`	CloudTrail `PutKeyPolicy`/`RevokeGrant`	Restore the key-policy statement/grant
`SERVFAIL` / empty answer	`dig` output	DNS broken: bad record, delegation, health-check flip	`dig +trace`, `dig @8.8.8.8`	Revert `ChangeResourceRecordSets`; fix NS
`ERR_CERT_DATE_INVALID`	Browser / client	Expired or wrong cert served at the edge	`openssl s_client` `notAfter`/subject	Deploy valid cert; restore ACM validation
`5xx` from your app	App logs / X-Ray	A real runtime exception in a handler	Logs Insights `filter status >= 500`; stack trace	Fix the throwing code path
`ResourceLimitExceeded` / `LimitExceeded`	EC2/ENI/EIP APIs	A resource-count quota (ENIs, EIPs, instances)	Service Quotas applied value; Trusted Advisor limits	Request increase; right-size; clean up leaks
`RequestLimitExceeded`	EC2 control plane	API call-rate throttling (describe storms)	CloudTrail volume; SDK retry logs	Backoff+jitter; cache describes; spread calls
`InternalError` / `5xx`	Any AWS API	Transient AWS-side error	PHD; retry succeeds	Retry with backoff; it’s usually self-healing
`ConditionalCheckFailedException`	DynamoDB	Optimistic-lock condition not met (often expected)	Is it in normal write paths?	Usually benign; only chase if it spikes

Three reading notes that save the most time:

Distinction	The trap	How to tell them apart
LB-emitted 502 vs app-emitted 5xx	Hours wasted in the wrong logs	If X-Ray shows the request succeeding (slowly) but the client got 502, the load balancer emitted it on timeout
Throttle vs application bug	“Scale up” masks the real ceiling	A throttle names the service and tracks traffic (vanishes at rest); a bug does not scale with load
`AccessDenied` from IAM vs SCP/KMS	Editing the wrong policy layer	Many services denied at one minute points above the app — SCP (mgmt account) or KMS key policy, not per-service IAM

The five worked scenarios that follow are the five complex-incident classes you will actually meet. Before the detail, here they are side by side on the dimensions that distinguish them — so mid-incident you can match the fingerprint to the scenario and jump to the right playbook:

Scenario	Distinguishing fingerprint	Confirming tool	CloudTrail at the minute	Mitigation	Durable fix
S1 — Throttle cascade	Latency up first, then 5xx, traffic flat	X-Ray red edge + `ReadThrottleEvents`	A deploy/flag on a third service	Kill the flag; on-demand; backoff	Restore Query/GSI; circuit breaker; bulkheads
S2 — AZ impairment	~1/3 fail, 2/3 fine, persistent	ALB `HealthyHostCount` by AZ + PHD	Quiet (infra, not a change)	Fail away from the AZ	Cross-AZ N+1; tested Multi-AZ failover
S3 — Quota / limit	Errors track traffic, vanish at rest	Service Quotas + `Throttles`	Maybe a lowered reservation	Raise quota; queue; shed load	Right-size concurrency; alarm at 80%
S4 — IAM/SCP blast radius	Many services `AccessDenied`, one minute	CloudTrail in mgmt account	The policy edit, exactly	Revert the policy change	Policy-as-code with review pipeline
S5 — DNS / cert	Dashboards green, users can’t load	`dig +trace` / `openssl s_client`	`ChangeResourceRecordSets` or none	Revert record / deploy valid cert	ACM auto-renew; record-as-code; canary

Worked scenario 1 — Cascading failure from a throttled dependency

Symptom. The checkout API’s p99 latency climbs from 200 ms to 8 s over ten minutes, then it starts returning 5xx. Traffic is flat — no marketing spike. The on-call for checkout swears they deployed nothing.

Hypotheses.

Checkout itself regressed (deploy, memory leak, GC). Unlikely — no deploy, and latency rose before errors.
A downstream dependency slowed or started rejecting, and checkout’s threads/connections are now blocked waiting on it (back-pressure cascade).
A shared resource (a database, a connection pool, a Lambda concurrency limit) is saturated and several services are contending for it.

Cross-service diagnosis. The golden-signal shape — latency up first, then errors, traffic flat — points away from load and towards a slow or rejecting dependency. Open the X-Ray service map for the checkout request: it shows checkout → order-service → a DynamoDB table, and the order-service → DynamoDB edge is red with high latency and a fault rate. Pivot to CloudWatch and overlay the table’s ReadThrottleEvents (or ThrottledRequests) metric — it went non-zero exactly when checkout’s latency began to climb. Now the question is why is the table throttling on flat traffic? CloudTrail lookup-events for the preceding hour shows a deploy on a different service, and a Logs Insights query on order-service shows it is doing a Scan where it used to Query because a feature flag flipped. The Scan reads the whole partition, blows the table’s provisioned (or hot-partition) capacity, DynamoDB throttles, order-service retries (amplifying load), its connection pool fills, and checkout — waiting on order-service — backs up and finally times out. Classic cascade: the symptom is in checkout, the trigger is a flag on a third service, the bottleneck is one DynamoDB partition.

The cascade as a hop-by-hop chain, so you can see exactly where it amplified:

Hop	What happens	Golden signal here	The amplifier
Flag flips on svc-C	`Query` → `Scan` on order-service	—	Reads the whole partition
DynamoDB partition	Capacity blown → throttle	`ReadThrottleEvents` up	Hot-partition limit
order-service	Retries the throttled call	Latency up, then errors	Retries with no backoff multiply load
Connection pool	Fills with blocked waiters	Saturation up	No bulkhead → shared pool exhausts
checkout	Threads blocked on order-service	p99 up, then 5xx	No timeout / circuit breaker

Fix.

Mitigate now: turn off the feature flag that introduced the Scan (removes the load source instantly); if the table is provisioned, bump capacity or enable on-demand to absorb the burst; ensure clients use exponential backoff with jitter so retries stop amplifying.
Fix the cause: restore the access pattern to a Query against a proper key/GSI; add a circuit breaker in checkout so a slow dependency fails fast instead of consuming all threads; set sensible client timeouts and bounded retries; consider DAX or caching for the hot read.
Prevent: alarm on ReadThrottleEvents and on the saturation of the connection pool; add a canary that exercises checkout end to end; require load-testing of access-pattern changes; adopt bulkheads so one dependency cannot exhaust a shared pool.

The resilience primitives that contain a cascade, and what each one stops:

Primitive	What it does	Stops which failure	Cost / trade-off
Timeout	Caps how long a call waits	Threads pinned on a slow dependency	Too tight → false failures
Exponential backoff + jitter	Spreads and slows retries	Retry storms amplifying a throttle	Slightly higher tail latency
Circuit breaker	Fails fast when a dependency is sick	Cascade through the caller	Needs tuning of open/half-open thresholds
Bulkhead	Isolates dependencies into separate pools	One downstream exhausting a shared pool	More pools to size and monitor
Caching / DAX	Cuts read volume to the hot path	Hot-partition throttle	Staleness; cache invalidation

The lesson: retries and missing timeouts turn a small dependency hiccup into a system-wide outage. Backoff with jitter, timeouts, circuit breakers and bulkheads are the resilience primitives that contain a cascade. The access-pattern depth is in DynamoDB Deep Dive: Tables, Keys, Capacity, GSIs & Streams.

Worked scenario 2 — Availability Zone impairment and failover

Symptom. Error rate jumps to roughly one-third of requests and p99 latency spikes, but two-thirds of requests are perfectly healthy. The pattern is partial and persistent — refreshes sometimes succeed, sometimes fail.

Hypotheses.

A bad deploy on a subset of instances (canary gone wrong).
One AZ is impaired — networking, a backing service, or capacity in that zone.
A single unhealthy backend (one RDS replica, one cache node) serving a fraction of traffic.

Cross-service diagnosis. “One-third failing, two-thirds fine” across a three-AZ deployment is the signature of single-AZ impairment. Confirm it three ways. First, the AWS Health Dashboard (Personal Health Dashboard) — check for an account-specific event naming an AZ ID (note: AZ names like us-east-1a are randomised per account; the Health event and your metrics should be reconciled by AZ ID, e.g. use1-az2). Second, break the load balancer’s metrics down by AZ: target group HealthyHostCount has dropped in one AZ and HTTPCode_Target_5XX_Count is concentrated there. Third, check the data tier — if RDS is Multi-AZ, is the primary in the impaired zone? Is one ElastiCache shard’s primary there? CloudTrail will be quiet — this is not a change you made, which itself is a strong signal that the cause is infrastructure, not config.

The three confirmations, with the exact signal each gives:

Confirmation	Where	Exact signal	Reads as
Health event	PHD (Personal Health Dashboard)	An open event naming an AZ ID (`use1-az2`)	AWS has acknowledged the zone
Per-AZ LB metrics	CloudWatch (ALB, dimension by AZ)	`HealthyHostCount` ↓ and `HTTPCode_Target_5XX_Count` ↑ in one AZ	Your traffic confirms the zone
Data-tier primary	RDS / ElastiCache console	Multi-AZ primary or shard primary sits in the impaired AZ	The stateful blast radius
CloudTrail	CloudTrail	Quiet at the inflection minute	Not a change you made → infra

The CLI to break ALB health down by AZ and to trigger an RDS failover:

# Healthy host count per AZ for one target group — the AZ with the drop is impaired
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB --metric-name HealthyHostCount \
  --dimensions Name=TargetGroup,Value=targetgroup/checkout/abc \
              Name=AvailabilityZone,Value=use1-az2 \
  --start-time 2026-06-15T14:00:00Z --end-time 2026-06-15T15:00:00Z \
  --period 60 --statistics Minimum --output table

# If the impaired AZ holds the RDS primary, fail over Multi-AZ (promotes the standby)
aws rds reboot-db-instance --db-instance-identifier orders-prod --force-failover

Fix.

Mitigate now: fail traffic away from the bad AZ. If targets are auto-registered, the load balancer’s health checks should already be routing around unhealthy targets — verify cross-zone load balancing and that healthy capacity in the other AZs can absorb the shifted load (this is why you provision for N+1 across AZs). For data, if the impaired AZ holds an RDS primary, trigger a Multi-AZ failover; promote a healthy read replica or fail over the cache primary as needed.
Fix the cause: you do not fix an AZ — AWS does. Your job is to confirm your failover actually works and that capacity headroom in surviving AZs is real, not theoretical.
Prevent: deploy across at least three AZs with enough headroom to lose one; enable RDS Multi-AZ and test failover regularly (game days); make sure Auto Scaling is balanced across AZs and that subnets exist in each; alarm on per-AZ healthy-host count, not just the aggregate.

What the architecture must already have for an AZ loss to be a non-event:

Guardrail	Why	How to verify	What its absence causes
≥3 AZs with N+1 headroom	Survive losing one zone with room to spare	ASG spread; subnet-per-AZ	Surviving AZs overload when one fails
Cross-zone load balancing	Spread shifted load evenly	ALB attribute enabled	One AZ’s targets overload
RDS Multi-AZ + tested failover	Promote standby fast	Game-day a forced failover	Stateful tier stuck in the bad AZ
Per-AZ alarms (not just aggregate)	The aggregate hides a one-AZ drop	Alarm on `HealthyHostCount` per AZ	You learn of it from customers
Reconcile by AZ ID, not name	AZ names are randomised per account	Map `us-east-1a`↔`use1-az2`	You chase the wrong zone

The lesson: design assuming an AZ will fail, then an AZ failure is a non-event. The incident is only severe if your architecture cannot tolerate losing one zone. The Multi-AZ depth is in RDS & Aurora Deep Dive: Engines, Multi-AZ, Replicas & Backups, and the load-balancer mechanics in Elastic Load Balancing: ALB, NLB & GWLB Deep Dive.

Worked scenario 3 — A service-quota breach under load

Symptom. During a traffic surge, a fraction of API requests fail with errors that mention Rate exceeded, LimitExceeded, or ThrottlingException — and the failures track traffic: more load, more failures, and they vanish when load drops. Nothing is “down”; the system is hitting an invisible ceiling.

Hypotheses.

An application-level rate limit or a downstream API throttle.
A Lambda concurrency or burst limit (account or function-level reserved concurrency).
An account Service Quota — API request rate, ENIs per Region, concurrent executions, EIPs, etc. — being exceeded as load scales.

Cross-service diagnosis. The “fails in proportion to load, recovers when load drops” shape is the fingerprint of a quota or limit, not a bug. Identify which ceiling. For Lambda, CloudWatch shows Throttles rising with invocations while ConcurrentExecutions flatlines at a round number (1,000 by default, or your reserved figure) — that is the concurrency limit. For API throttling, the error envelope names the service; check the Service Quotas console for that service’s “applied quota value” and compare against your CloudWatch usage metrics (many quotas now publish a usage metric you can alarm on). CloudTrail can show whether someone recently lowered a reserved concurrency or whether a new function ate the account pool. The tell that distinguishes this from scenario 1: here the errors are immediate throttles that scale with traffic, not latency-then-timeout from a slow dependency.

The common quota ceilings that bite under load, their default, and how each shows up:

Quota	Typical default	Metric / signal	Symptom under load	Mitigation
Lambda concurrent executions (account)	1,000	`Throttles` up, `ConcurrentExecutions` flat at 1,000	429 `TooManyRequestsException`	Raise account quota; reserved per function
Lambda reserved concurrency (function)	shared pool	`Throttles` on one function only	That function throttles, others fine	Raise/remove the too-low reservation
API request rate (per service)	service-specific	`ThrottlingException`, `Rate exceeded`	A fraction of calls 400/429	Backoff+jitter; quota increase
ENIs per Region	account-specific	`ResourceLimitExceeded` on scale-out	New tasks/ENIs fail to launch	Request increase; reduce ENI churn
Elastic IPs per Region	5	`AddressLimitExceeded`	Cannot allocate an EIP	Request increase; release unused EIPs
DynamoDB on-demand throughput	account/table	`ProvisionedThroughputExceeded` (burst)	Throttles on a sudden spike	Pre-warm; switch capacity mode

Fix.

Mitigate now: request a quota increase via the Service Quotas console or API (some are auto-approved, some go to Support — for a Sev-1, open a Support case in parallel); for Lambda, raise reserved concurrency or remove a too-low reservation that is starving the function; shed non-critical load or queue it (SQS) to flatten the spike below the ceiling.
Fix the cause: right-size reserved concurrency per function so one workload cannot starve others; put a queue in front of spiky producers so the consumer drains at a controlled rate; cache to cut call volume to the throttled API.
Prevent: this is the headline preventive — monitor quotas proactively. Use Service Quotas usage metrics and Trusted Advisor’s service-limit checks to alarm at ~80% of every quota that scales with your traffic, so the next surge is a warning, not an outage.

The CLI to read an applied quota and to request an increase:

# Read the applied value for Lambda concurrent executions (quota L-B99A9384)
aws service-quotas get-service-quota \
  --service-code lambda --quota-code L-B99A9384 \
  --query 'Quota.{Name:QuotaName,Applied:Value}' --output table

# Request an increase (some auto-approve; others raise a Support case)
aws service-quotas request-service-quota-increase \
  --service-code lambda --quota-code L-B99A9384 --desired-value 3000

The lesson: quotas are a capacity-planning problem, not an incident. If you discover a quota during an outage, you have a monitoring gap, not just a limit. The concurrency mechanics are in Lambda Deep Dive: Runtimes, Triggers, Layers & Concurrency.

Worked scenario 4 — An IAM/SCP change with a wide blast radius

Symptom. At a precise timestamp, multiple unrelated services across one or several accounts start failing with AccessDenied — uploads to S3, a Lambda that can no longer write to DynamoDB, an ECS task that cannot pull from ECR. No application deploy happened. The breadth is the clue: many services, one moment, same error class.

Hypotheses.

A KMS key policy or grant was changed and everything that decrypts through it now fails (a wide but specific blast radius).
An IAM change — a shared role’s permissions, or a permission boundary tightened.
A Service Control Policy (SCP) at the Organization/OU level changed and is now denying actions org-wide regardless of IAM (SCPs set the maximum permissions; an explicit deny there wins everywhere).

Cross-service diagnosis. When AccessDenied appears across many services at the same minute, suspect a policy layer above the application, because identity-based policies are usually edited per service. Go straight to CloudTrail and look up PutBucketPolicy, PutKeyPolicy, PutRolePolicy, DeleteRolePolicy, PutPermissionsBoundary, and crucially the Organizations events UpdatePolicy/AttachPolicy (SCPs are recorded in the management/delegated-admin account’s CloudTrail). The timestamp of the change will line up exactly with the onset. To prove which layer is denying, take one failing call and reason through the evaluation chain: an SCP deny blocks the action no matter what IAM allows, so if the IAM policy looks correct but the call still fails org-wide, the SCP (or a permission boundary, or a resource-policy/KMS-key-policy deny) is the culprit. CloudTrail’s errorCode: AccessDenied entries, combined with knowing that explicit deny always wins, let you walk from symptom to the exact policy edit.

The policy layers, ordered by reach, with the CloudTrail event that records an edit to each:

Layer	Reach	Explicit deny here means…	CloudTrail event to look up	Where the trail lives
SCP (Organizations)	Whole OU / org	Denied everywhere, regardless of IAM	`AttachPolicy`, `UpdatePolicy`, `DetachPolicy`	Management / delegated-admin account
Resource policy (S3 bucket, etc.)	That resource	Denied on that resource for all principals	`PutBucketPolicy`, `PutRepositoryPolicy`	The resource’s account
KMS key policy / grant	Everything using the key	Decrypt/encrypt fails for the named principals	`PutKeyPolicy`, `RevokeGrant`, `CreateGrant`	The key’s account
Permission boundary	The bounded principal	Caps that role even if its policy allows	`PutPermissionsBoundary`, `DeleteRolePermissionsBoundary`	The principal’s account
Identity policy (IAM)	One user/role	The usual per-service permission	`PutRolePolicy`, `AttachRolePolicy`, `DeleteRolePolicy`	The principal’s account

The decision table — given the breadth and the trail, which layer is the culprit:

If you see…	And CloudTrail shows…	It’s probably…	Do this
Many services, many accounts, one minute	`AttachPolicy`/`UpdatePolicy` in mgmt account	An SCP edit	Detach/correct the SCP in the management account
Everything that decrypts fails	`PutKeyPolicy`/`RevokeGrant`	A KMS key policy/grant change	Restore the `Decrypt`/`GenerateDataKey` statement
One bucket denies all writers	`PutBucketPolicy`	A resource policy edit	Revert the bucket policy statement
One role suddenly limited	`PutPermissionsBoundary`	A tightened boundary	Loosen/correct the boundary narrowly
One service, one account	`PutRolePolicy`/`DeleteRolePolicy`	An identity policy edit	Revert the role policy

Fix.

Mitigate now: revert the offending policy change — this is the fastest mitigation and CloudTrail tells you precisely what changed and to what. If it was an SCP, detach or correct it in the management account; if a KMS key policy, restore the statement that granted the failing principals kms:Decrypt/kms:GenerateDataKey.
Fix the cause: re-apply the intended restriction correctly and narrowly — most blast-radius incidents come from an overly broad deny or a removed-too-much edit. Scope conditions tightly; never broaden a deny without modelling who it hits.
Prevent: manage IAM/SCPs/key policies as code through a pipeline with review and a plan/diff, never click-ops in the console; test SCP changes against a non-production OU first; use IAM Access Analyzer and policy simulation pre-merge; alarm on CloudTrail for sensitive events (PutKeyPolicy, Organizations AttachPolicy/UpdatePolicy) so a risky change is visible immediately.

The CloudTrail lookup that finds the offending edit, by event name:

# Find SCP/role/key-policy edits in the inflection window (run in the mgmt account for SCPs)
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AttachPolicy \
  --start-time 2026-06-15T14:25:00Z --end-time 2026-06-15T14:35:00Z \
  --query 'Events[].{Time:EventTime,User:Username,Event:EventName}' --output table

The lesson: a one-line policy edit can have the widest blast radius in AWS. The layered evaluation model (SCP → resource policy → permission boundary → identity policy, with explicit deny trumping all) is exactly what you walk during diagnosis — and exactly why these changes belong in a reviewed pipeline. The governance depth is in Organizations, SCP Guardrails & Delegated Admin and KMS Encryption Deep Dive: Keys, Policies, Envelope & Rotation.

Worked scenario 5 — A Route 53 / DNS or certificate failure

Symptom. Users report “the site won’t load” or “your connection is not private,” but your load balancer and application metrics look completely healthy — low latency, no 5xx, normal CPU. The failure is happening before traffic reaches your infrastructure, which is why your dashboards are clean.

Hypotheses.

DNS resolution is failing or returning the wrong answer (a record was changed/deleted, a failover/health-check flipped, an NS delegation issue).
A TLS certificate expired or the wrong certificate is being served (ACM cert not renewed because validation lapsed; cert/domain mismatch).
CDN/edge or WAF is blocking or misrouting at the edge.

Cross-service diagnosis. Healthy backend metrics with “can’t load” reports is the signature of an edge/DNS/cert problem. Split the two possibilities at the command line. For DNS, resolve the name and inspect the answer end to end:

dig +trace example.com           # follows the delegation chain NS by NS
dig example.com A @8.8.8.8       # what a public resolver actually returns

If dig returns the wrong IP, an empty answer, or SERVFAIL, the problem is DNS. In Route 53, check whether a health check flipped a failover record (a failing health check will route away from the healthy endpoint), whether a record was edited (CloudTrail ChangeResourceRecordSets shows the change and the actor), and whether the hosted zone’s NS records still match the registrar’s delegation. For TLS, inspect the served certificate’s expiry and subject:

echo | openssl s_client -connect example.com:443 -servername example.com 2>/dev/null \
  | openssl x509 -noout -dates -subject -issuer

An expired notAfter, or a subject/SAN that does not cover the host, is your cause. With ACM, expiry almost always traces back to DNS validation breaking (the CNAME validation record was removed, so ACM could not auto-renew) — and CloudTrail/ACM events will show the renewal failures.

The edge failure modes, the one command that confirms each, and the fix:

Failure mode	Confirm with	Tell-tale	Mitigate	Fix the cause
Record edited/deleted	`dig A @8.8.8.8`; CloudTrail `ChangeResourceRecordSets`	Wrong/empty answer	Revert the record change	Manage records as reviewed code
Failover health-check flipped	Route 53 health-check status	Traffic routed to the secondary/none	Correct the health-check config	Tune thresholds; right endpoint
NS delegation mismatch	`dig +trace`; compare registrar NS	Delegation chain breaks	Fix registrar NS to match zone	Document delegation; alarm on drift
Cert expired	`openssl s_client` `notAfter`	`ERR_CERT_DATE_INVALID`	Deploy a valid cert now	Let ACM auto-renew; keep CNAME
Wrong cert / SAN mismatch	`openssl s_client` `-subject`	Host not in subject/SAN	Attach the correct cert	Issue/replace cert covering the host
ACM auto-renew failed	ACM console; CloudTrail renewal events	Validation CNAME missing	Re-add the validation CNAME	Keep the DNS validation record intact

Fix.

Mitigate now: for DNS, revert the bad ChangeResourceRecordSets or correct the failover/health-check configuration so traffic routes to the healthy endpoint; for an expired cert, deploy a valid certificate to the load balancer/CloudFront immediately. Remember DNS TTLs mean changes are not instant — lower the TTL if you anticipate needing fast cutovers.
Fix the cause: restore the ACM DNS validation CNAME so auto-renewal works (this is the single most common ACM expiry cause); fix the NS delegation if the registrar and hosted zone disagree; correct health-check thresholds that are too sensitive.
Prevent: let ACM manage and auto-renew certificates with DNS validation kept intact (no manual certs to forget); alarm on DaysToExpiry for any certificate; manage Route 53 records as code with review; add an external synthetic check (a canary that resolves DNS and validates the cert from outside your VPC) so you detect an edge failure your internal dashboards cannot see.

The lesson: when your metrics are healthy but users cannot reach you, look outward — DNS, certificate, edge — because the failure is upstream of everything your CloudWatch sees. The routing and health-check depth is in Route 53: DNS Records, Routing Policies & Health Checks.

Service Quotas, Trusted Advisor and the Health Dashboard as prevention

Three of the five scenarios above (and most real incidents) are preventable with hygiene that lives outside the application:

Service Quotas is the source of truth for account limits. Most quotas now publish a usage CloudWatch metric — alarm at ~80% of every quota that scales with traffic (concurrent Lambda executions, ENIs/Region, EIPs, API request rates, RDS instances). Request increases ahead of known peaks.
Trusted Advisor runs periodic checks across cost, performance, security, fault tolerance and service limits. Its limit and fault-tolerance checks are a cheap pre-incident sweep; its security checks catch the open-bucket and over-broad-policy classes before they bite.
AWS Health Dashboard — the Personal Health Dashboard (PHD) shows events specific to your account and resources (scheduled maintenance, AZ events, deprecations) and is your first stop to answer “is this AWS or me?” The Service Health Dashboard is the public, all-customers view. PHD can fire EventBridge events so AWS-side issues page you automatically.

The three prevention tools, what they catch, their cadence, and how to wire an alarm:

Tool	Catches	Cadence	How to alarm / automate	Cost
Service Quotas (usage metrics)	Approaching a limit that scales with traffic	Near-real-time metric	CloudWatch alarm at 80% of `AWS/Usage`	Free
Trusted Advisor (limit + FT checks)	Service-limit headroom, single-AZ risk, open buckets	Periodic (refresh)	EventBridge on check status; weekly review	Business/Enterprise Support for full set
AWS Health (PHD)	AWS-side events for your resources/AZs	Event-driven	EventBridge rule → SNS/page	Free
CloudWatch composite alarm	Multi-signal SLO breach (burn rate)	Real-time	Composite of golden-signal alarms	Per-alarm pricing

The pattern across all three: convert future incidents into present-day alarms. A quota you alarm on at 80% is a Tuesday-afternoon ticket; the same quota discovered at 100% during a launch is a Sev-1.

Closing the loop: the blameless COE

Prevention is only real once it is written down and tracked. Amazon’s Correction of Error (COE) is a blameless postmortem with a fixed shape — and the discipline is that every contributing factor produces a tracked, owned, dated CAPA action item. The sections, what each captures, and the trap if you skip it:

COE section	What it captures	Done well	Trap if skipped
Summary	One paragraph: what broke, who was impacted	Plain, customer-framed	Reads as internal jargon nobody acts on
Impact	Duration, scope, requests/revenue affected	Quantified, not “some users”	Severity gets re-litigated later
Timeline	Detection → mitigation → resolution on one UTC clock	Minute-by-minute with actors	“It was a blur” — no learning
Five-whys	Trigger → … → systemic root cause	Reaches a missing alarm/timeout/review	Stops at the trigger; it recurs
Contributing factors	Everything that made it worse or slower	Honest, blameless	Single-cause myth hides real gaps
CAPA action items	Owned, dated corrective + preventive work	One alarm/guardrail per factor	A wish list nobody completes
Lessons learned	What the team now knows	Shared org-wide	Knowledge stays in one head

The five-whys is where juniors stop too early. “DynamoDB throttled” is a trigger, not a root cause; keep asking until you reach the systemic gap — why was there no throttle alarm, no backoff, no load test on access-pattern changes? — because that gap is what the CAPA items must close.

Architecture at a glance

The diagram traces a real request as it actually flows and maps each of the five complex-incident classes onto the exact hop where it bites. Read it left to right. A request enters at the edge: Route 53 resolves the name and CloudFront terminates TLS with an ACM certificate — this is where DNS/cert failures (badge 1) strike, and the cruel part is that everything downstream stays green, so your dashboards lie. It passes into ingress, an ALB spread across three AZs with WAF in front — this is where a single-AZ impairment (badge 2) shows up as one-third of requests failing while the per-AZ HealthyHostCount drops in exactly one zone. It reaches compute: the checkout service calls the order service (ECS/EKS pools and Lambda concurrency), where a quota or concurrency ceiling (badge 3) throttles in proportion to load. The order service then hits the data tier — a DynamoDB table whose hot partition throttles and cascades back upstream (badge 4) when a flag flips Query to Scan, plus an RDS Multi-AZ instance whose primary may sit in the impaired zone.

Above and beneath the path runs the control & correlation plane — CloudWatch (metrics + Logs Insights), CloudTrail (who changed what), and the X-Ray service map (the fault edge) — all read against one UTC clock. That plane is also where an IAM/SCP blast-radius change (badge 5) is diagnosed: a one-line policy edit denies many services at one minute, and only CloudTrail in the management account shows the edit. The whole method is in the picture: localise the symptom to a hop, read the golden-signal shape, run the named tool on one clock, and walk to the cause even when it lives a hop or a policy-layer away.

Real-world scenario

Northwind Commerce runs an e-commerce platform on AWS in ap-south-1 (Mumbai): CloudFront → ALB (three AZs) → an ECS Fargate checkout service → a Lambda order service → a DynamoDB orders table, with RDS Multi-AZ (PostgreSQL) for the ledger and an SQS queue for async fulfilment. Traffic averages 600 requests/second with a 7pm festival-sale spike to ~2,400 rps. The platform team is five engineers across two squads; the monthly AWS spend is about ₹9,80,000.

The incident began on a Saturday festival sale. At 19:06 the SLO burn-rate alarm fired: checkout error rate crossing 5% and climbing. The first responder’s reflex was to look at the checkout service — CPU normal, no deploy, recent logs unremarkable. Meanwhile the on-call for the order squad swore they had not deployed either. Two squads, two dashboards, and for the first eight minutes no Incident Commander — three theories, no progress, exactly the failure the lifecycle exists to prevent.

The turn came when a senior engineer took the IC role, put every console in UTC, and ran the golden-signal sweep on checkout and its dependencies. The shape was unambiguous: checkout p99 rose first (200 ms → 6 s), then errors followed, with traffic only modestly up — a slow/rejecting dependency, not load. The X-Ray service map showed the order-service → DynamoDB edge glowing red with a fault rate. Overlaying the table’s ReadThrottleEvents in CloudWatch, the metric went non-zero at 19:04 — two minutes before the alarm. Now the real question: why throttle on near-flat traffic? A CloudTrail lookup-events for the preceding hour surfaced a UpdateFunctionConfiguration on a recommendations Lambda at 18:58, and a Logs Insights query on order-service showed it had started doing a Scan instead of a Query — a shared library upgrade in that deploy had flipped a feature flag that changed the access pattern. The Scan hammered one partition, DynamoDB throttled, order-service retried without backoff (amplifying the load), its connection pool filled, and checkout — blocked waiting on order-service — backed up and finally 5xx’d. The symptom was in checkout; the trigger was a deploy on a third service; the bottleneck was one DynamoDB partition.

Mitigation, in order: the IC had the recommendations squad revert the flag (removing the Scan instantly), flipped the orders table to on-demand to absorb the residual burst, and confirmed clients now had backoff with jitter via a config push. Error rate fell below 1% within four minutes of the flag revert. They did not scale the checkout service — that would have masked nothing and cost money, the classic reflex the team had been burned by before.

The durable fix landed the following week: restore the Query access pattern against the right GSI; add a circuit breaker in checkout so a slow order-service fails fast instead of consuming the whole pool; bound retries and set a 2-second client timeout; and add bulkheads so the order dependency cannot exhaust checkout’s shared pool. On prevention they wired three alarms — ReadThrottleEvents > 0, connection-pool saturation, and an end-to-end checkout canary — and added a CI gate requiring a load test for any access-pattern change. The next festival sale ran at 2,500 rps with zero DynamoDB throttles, checkout p95 held at 180 ms, and the COE’s headline line went on the wall: “The failing service is rarely the broken one — read the shape, follow the edge, find the flag.”

The incident as a timeline, because the order of moves is the lesson:

Time (UTC offset)	Symptom	Action taken	Effect	What it should have been
18:58	(none yet)	recommendations deploy flips a flag	Scan begins on order-service	Load-test access-pattern changes in CI
19:04	`ReadThrottleEvents` > 0	(no alarm — gap)	Throttle starts; latency builds	An alarm here would pre-empt the page
19:06	Checkout error > 5%	Burn-rate alarm fires; stare at checkout	8 min lost, no IC	Name an IC immediately
19:14	Still climbing	Senior takes IC, sweeps golden signals in UTC	Shape = slow dependency	The breakthrough
19:17	Edge identified	X-Ray red edge + `ReadThrottleEvents` overlay	Bottleneck = one DDB partition	—
19:20	Trigger found	CloudTrail shows 18:58 deploy; Logs show Scan	Cause = flipped flag	—
19:24	Mitigated	Revert flag; table → on-demand; backoff push	Errors < 1%	Correct night-of mitigation
+1 week	Fixed	Query+GSI, circuit breaker, bulkheads, 3 alarms, canary	0 throttles at 2,500 rps	The actual fix is code + alarms

Advantages and disadvantages

A disciplined, correlation-first incident method has clear strengths and real costs. Weigh it honestly before you mandate it:

Advantages (why this method pays off)	Disadvantages (why it has a cost)
Finds the cause even when it’s a hop or a policy-layer away — the symptom service is rarely the broken one	Requires instrumentation before the incident: structured logs with trace IDs, X-Ray on the critical path, quota alarms
The golden-signal shape classifies the failure in two minutes, before any log-reading	Needs practised judgement — reading shapes well is a skill, not a checklist
One UTC clock across four tools removes the “wrong time zone” class of false conclusions	Discipline slips under pressure; it takes drills (game days) to make it muscle memory
Mitigate-then-diagnose minimises customer impact; the telemetry keeps the evidence	The instinct to “understand first” is strong; juniors resist mitigating blind
The IC role prevents three-theory thrash and protects responders	A dedicated IC is staffing overhead small teams feel acutely
Blameless COE + CAPA closes the loop so incidents don’t recur	Postmortems take time and only pay off if CAPA items are tracked to done
Quota/AZ/cert hygiene converts whole incident classes into Tuesday tickets	Alarms and canaries cost a little to run and must be maintained (alarm fatigue is real)

The method is right for any team running a real distributed system where minutes of customer impact are expensive and incidents span services and teams. It is overkill for a single static site or a one-service hobby project — there, a stack trace is enough. The disadvantages are all front-loaded investment: instrument, alarm, drill. Pay it before the incident, or pay double during one.

Hands-on lab — correlate a self-inflicted incident (Free Tier)

You will create a tiny, safe incident and practise the correlation workflow end to end, then clean up. Everything here is Free-Tier-eligible if you tear it down promptly.

1. Set up a function and a log group. Create a minimal Lambda (any runtime) named coe-lab that logs a structured line and occasionally “fails”:

aws lambda create-function --function-name coe-lab \
  --runtime python3.13 --handler index.handler --timeout 5 \
  --role arn:aws:iam::111122223333:role/your-lambda-basic-role \
  --zip-file fileb://function.zip

(function.zip contains an index.py whose handler prints {"level":"INFO","reqId":context.aws_request_id,...} and raises an exception when the event has {"fail": true}.)

2. Generate a signal. Invoke it a few times, including failures, to put both success and error lines into CloudWatch Logs:

for i in $(seq 1 20); do
  aws lambda invoke --function-name coe-lab --payload '{"fail":false}' /dev/null >/dev/null
done
aws lambda invoke --function-name coe-lab --payload '{"fail":true}' /dev/null

3. Triage with metrics. In the CloudWatch console, open the Errors, Invocations and Throttles metrics for coe-lab (set the timezone to UTC). Note the minute the error appears — that is your inflection point.

4. Drill in with Logs Insights. Run a query against the function’s log group:

fields @timestamp, @message
| filter @message like /ERROR|Traceback|"level":"ERROR"/
| stats count(*) as errors by bin(1m)
| sort @timestamp desc

Confirm the error count and timing match the metric. This is the metric→logs pivot you will do in every real incident.

5. Tie it to a change with CloudTrail. Update the function’s configuration (a harmless change) to create a control-plane event:

aws lambda update-function-configuration --function-name coe-lab --timeout 6
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=coe-lab \
  --query 'Events[].{Time:EventTime,User:Username,Event:EventName}' --output table

You should see your UpdateFunctionConfiguration event with its timestamp and actor — exactly how you would prove “someone changed this at 14:31”.

6. Read a quota’s applied value. Practise the scenario-3 move — see how much headroom you have on Lambda concurrency:

aws service-quotas get-service-quota \
  --service-code lambda --quota-code L-B99A9384 \
  --query 'Quota.{Name:QuotaName,Applied:Value}' --output table

Validation. You have, on one UTC timeline, (a) a metric inflection, (b) the matching log lines, © the control-plane change that explains a config-driven incident, and (d) a read of the quota that scenario 3 would breach. That is the core RCA loop in miniature.

Each lab step mapped to the real incident move it rehearses:

Lab step	Real-incident move	Tool
3 — note the inflection minute (UTC)	Golden-signal sweep	CloudWatch metrics
4 — Logs Insights error count by minute	Metric → logs pivot	Logs Insights
5 — CloudTrail `lookup-events`	Tie a step change to a change	CloudTrail
6 — read the applied quota	Quota headroom check (S3)	Service Quotas

Cleanup.

aws lambda delete-function --function-name coe-lab
aws logs delete-log-group --log-group-name /aws/lambda/coe-lab

Cost note. A handful of Lambda invocations and a few Logs Insights queries fall within the Free Tier; the log group stores a trivial amount. CloudTrail management events are free to view via lookup-events. Deleting the function and log group leaves nothing billable. If you ever enable CloudTrail data events or create a CloudWatch alarm at scale, those can incur small charges — not needed for this lab.

Common mistakes & troubleshooting

This is the differentiator. Cross-service RCA fails for process reasons as often as technical ones — and the technical traps recur. The playbook below is the table to keep open at 02:14: match the symptom of your investigation or the incident, read the cause, run the exact confirm, apply the fix.

#	Symptom (of your process or the incident)	Root cause	Confirm (exact command / path)	Fix
1	Investigation goes in circles; the “fix” doesn’t hold	Comparing signals in different time zones	Check each console/query header — is it local or UTC?	Put every console and query in UTC; one clock, one window
2	You debug 40 min while customers suffer	Trying to find root cause before mitigating	Is customer impact still live and unmitigated?	Mitigate first (roll back / fail over / raise quota), diagnose after
3	Three people, three theories, no progress	No Incident Commander	Is anyone coordinating vs everyone debugging?	Name an IC who decides and coordinates and does not debug
4	You “fixed” the trigger but it recurs	Treated the trigger as the cause	Did the five-whys reach a systemic gap (no alarm/timeout)?	Five-whys to the systemic cause; add the missing guardrail
5	The one trace you need isn’t in X-Ray	Default sampling dropped it	X-Ray sampling rules show the default reservoir	Raise sampling for the affected service; lean on metrics+logs meanwhile
6	CloudTrail “shows nothing” around the event	Wrong account, or only management events	Are you in the mgmt/delegated-admin account? Window wide enough?	Check the org trail; widen for delivery latency; enable data events if needed
7	You blame the failing service	Symptom and cause are in different services	Does the X-Ray map show the fault on a downstream edge?	Follow the service map edge; read golden-signal shapes
8	You scaled up and it “worked,” then returned worse	Masked a resource ceiling (quota/throttle/saturation)	Do errors track traffic (vanish at rest)?	Find the ceiling (Service Quotas / throttle metric); fix, don’t mask
9	502s blamed on the app, but app logs are clean	The load balancer emitted the 502 on timeout	X-Ray shows the request succeeding slowly; client got 502	Speed up the target; align LB/integration timeout
10	“One-third fail, two-thirds fine” chased as a bad host	Single-AZ impairment	ALB `HealthyHostCount` by AZ-ID; PHD event	Fail away from the AZ; rely on cross-AZ N+1 headroom
11	Many services `AccessDenied`, you edit each IAM policy	The deny is a layer above IAM (SCP/KMS)	CloudTrail `AttachPolicy`/`PutKeyPolicy` in mgmt account	Revert the SCP/key-policy edit — explicit deny wins
12	Dashboards green but users can’t load; you check the app	Failure is upstream (DNS/cert/edge)	`dig +trace`; `openssl s_client … -dates`	Restore the record / ACM validation CNAME / valid cert
13	Postmortem names a person	Blameful culture	Does the COE ask “who” or “what about the system”?	Run a blameless COE; fix the system that let it happen
14	Quota discovered at 100% during a launch	No quota monitoring	Is there an alarm at 80% of this quota?	Alarm on the Service Quotas usage metric at ~80%

Best practices

Impose the lifecycle every time, even for small incidents — the muscle memory pays off when it is a Sev-1 at 03:00.
One Incident Commander; comms on a cadence. Separate coordinating, communicating and debugging into different people.
Mitigate then diagnose. Optimise for time-to-recovery; the telemetry keeps the evidence for the RCA.
Everything on one UTC timeline. Metrics, logs, traces and CloudTrail aligned to the same clock and window.
Read the golden-signal shape first. Latency/traffic/errors/saturation movement classifies the failure before any log line.
Instrument for correlation before you need it: structured logs with request and trace IDs, X-Ray on the critical path, alarms on the four golden signals and on quotas at 80%.
Build for failure: multi-AZ with headroom, timeouts + backoff-with-jitter + circuit breakers + bulkheads, tested Multi-AZ failover, queues in front of spiky producers.
Govern change: IAM/SCP/DNS/cert as reviewed code; alarm on sensitive CloudTrail events.
Never scale to mask a ceiling. If errors track traffic, find the quota/throttle — scaling up just delays and hides it.
Close the loop with a blameless COE and track CAPA items to done — an action item without an owner and a date is a wish.

Security notes

Treat a sudden, broad AccessDenied as potentially a security event, not just an outage — it can equally be an attacker’s tightened policy, a compromised credential changing permissions, or a legitimate-but-wrong edit. CloudTrail (including the Organizations trail) is your forensic record; protect it with log-file validation and a separate, locked-down logging account so an attacker cannot cover their tracks.
During an incident, mitigations sometimes tempt you to loosen security (open a security group “just to test”, attach an over-broad policy). Resist, or scope it tightly and time-box it — incident-time exceptions are how durable holes get created. Record any exception in the COE and revert it before you close the incident.
Least-privilege the responders too: break-glass roles should be auditable, MFA-protected and alerted-on, not shared admin keys.
For DNS/cert incidents, remember that a hijacked Route 53 record or a mis-issued certificate is a security incident — verify who changed the record (CloudTrail ChangeResourceRecordSets) and protect hosted zones and ACM with tight IAM and change review.
Keep CloudTrail data events in mind: they are off by default and cost extra, but for sensitive buckets/tables they are the difference between knowing and guessing who read what during a suspected breach.

The security-relevant control-plane events to alarm on, and why each matters during an incident:

Event	Service	Why alarm on it	During an incident it tells you
`AttachPolicy` / `UpdatePolicy`	Organizations (SCP)	Widest blast radius in AWS	An org-wide deny was just introduced
`PutKeyPolicy` / `RevokeGrant`	KMS	Breaks everything that decrypts	Why many services lost `Decrypt`
`PutRolePolicy` / `DeleteRolePolicy`	IAM	Per-service permission change	A specific role gained/lost access
`ChangeResourceRecordSets`	Route 53	DNS hijack / outage vector	Who edited the record and to what
`AuthorizeSecurityGroupIngress`	EC2	An incident-time “just to test” hole	A security group was opened up
`StopLogging` / `DeleteTrail`	CloudTrail	An attacker covering tracks	Your evidence source was tampered with
`PutBucketPolicy` / `PutBucketAcl`	S3	Can expose a bucket publicly	A data-exposure change just landed
`ConsoleLogin` (failure / new region)	IAM (CloudTrail)	Credential misuse signal	Who logged in, from where, success or not

Cost & sizing

The incident method itself is nearly free — the cost is in the instrumentation that makes correlation possible, and the trade-off is “spend a little continuously to avoid spending a lot during an outage.” What drives the observability bill, and how to right-size it:

Cost driver	What it is	Rough figure	Right-size by
CloudWatch custom metrics	Per-metric monthly charge	~$0.30/metric/mo (first tier)	Emit only the golden signals + key dependencies
CloudWatch Logs ingestion	Per-GB ingested	~$0.50–0.57/GB (Region-dependent)	Sample/structure logs; drop debug in prod
Logs Insights queries	Per-GB scanned	~$0.005/GB scanned	Narrow the time window; filter early
X-Ray traces	Per-trace recorded/retrieved	~$5 per 1M traces recorded	Sample the critical path; raise only during incidents
CloudTrail management events	First copy of mgmt events	Free	Always on; it’s your evidence
CloudTrail data events	S3/DynamoDB object-level	~$0.10 per 100K events	Enable only on sensitive buckets/tables
CloudWatch alarms	Per-alarm monthly	~$0.10/alarm/mo (standard)	Alarm on signals + quotas, not everything
Synthetics canaries	Per canary run	~$0.0012/run	One end-to-end canary per critical journey

A sizing rule of thumb: the four-golden-signal alarms plus a quota alarm at 80% plus one end-to-end canary per critical user journey is a few hundred rupees a month for most stacks — and it is the difference between a Sev-1 and a Tuesday ticket. The expensive mistake is the opposite: no instrumentation, then a multi-hour outage whose revenue cost dwarfs a year of observability spend. Free Tier covers the lab here entirely; production observability scales with traffic, but the golden-signal subset keeps it bounded.

The cost-of-an-incident framing, which is what justifies the spend:

If you skip…	The incident it enables	Rough cost of that incident
A quota alarm at 80%	A launch-time Sev-1 throttle	Revenue per minute × outage minutes
Per-AZ healthy-host alarms	A one-AZ impairment chased blind	Extended MTTR; possible full outage if N+1 absent
A `DaysToExpiry` cert alarm	A site-wide cert expiry	Total outage until a cert is reissued
An external DNS/cert canary	An edge failure your dashboards can’t see	Customer-reported outage; reputational hit
X-Ray on the critical path	A cascade you can’t localise	Hours of MTTR finding the fault edge
A blameless COE + tracked CAPA	The same incident next quarter	Repeat outage at full cost — the worst spend of all

Interview & exam questions

1. Walk me through how you run a multi-service incident. Detect → triage (scope, severity, name an IC and comms lead) → communicate on a cadence → mitigate before fully diagnosing → RCA by correlating metrics/logs/traces/CloudTrail on one UTC timeline → prevent via a blameless COE with tracked CAPA items.

2. The symptom is in service A but A didn’t change — how do you find the cause? Use the X-Ray service map to find the failing edge, read the four golden signals for A and its dependencies to classify the failure shape, then pivot to the dependency’s metrics/logs and to CloudTrail for any change at the inflection minute. The cause is usually a downstream throttle, a saturated shared resource, or a config change elsewhere.

3. Errors rise in proportion to traffic and vanish when load drops — what is it, usually? A quota or limit (Lambda concurrency, API request rate, ENIs, etc.), not a bug. Identify it via the throttle metrics and the Service Quotas console; mitigate with a quota increase / load shedding / queueing; prevent by alarming at 80% of quota.

4. One-third of requests fail, two-thirds are fine, persistently — what’s your first hypothesis? Single-AZ impairment on a three-AZ deployment. Confirm with the Personal Health Dashboard, per-AZ load-balancer/target metrics (by AZ ID, since AZ names are randomised per account), and the data tier (is an RDS/cache primary in that AZ). Fail traffic away and rely on cross-AZ headroom.

5. Difference between a mitigation and a fix — give an example. A mitigation stops customer pain now (roll back the deploy, fail over the AZ, raise the quota); a fix removes the cause (correct the access pattern, fix the policy, restore ACM validation). You mitigate first, then fix; both belong in the COE.

6. Multiple services across accounts throw AccessDenied at the same minute — what changed? Almost certainly a policy layer above the app: an SCP (in the management/delegated-admin account), a KMS key policy, or a shared role/permission boundary. CloudTrail (PutKeyPolicy, Organizations AttachPolicy/UpdatePolicy, PutRolePolicy) shows the edit; explicit deny wins, so revert the change to mitigate.

7. Your dashboards are green but users say the site won’t load — where do you look? Outward — DNS and certificates and edge. dig +trace to validate resolution and openssl s_client to check the served cert’s expiry/subject. Common cause: ACM auto-renewal failed because the DNS validation CNAME was removed; or a Route 53 record/health-check flipped.

8. What’s a cascading failure and how do you prevent it? A small dependency hiccup amplified by retries and missing timeouts until threads/connections exhaust and the failure spreads upstream. Prevent with timeouts, exponential backoff with jitter, circuit breakers and bulkheads, plus alarms on dependency throttles and saturation.

9. How do you correlate a metric spike to a specific change? Note the exact UTC minute of the inflection, then aws cloudtrail lookup-events for that window. A control-plane event in the same minute (a deploy, a quota edit, a policy change) is your trigger.

10. Why blameless postmortems? Because blame drives information underground — people stop sharing what really happened, and you fix symptoms instead of the systemic gaps (missing alarm, no timeout, click-ops policy edit) that let a human error become an outage. The COE asks “what about the system allowed this?”, not “who did it?”

11. X-Ray is sampled — what if the failing trace wasn’t captured? Temporarily raise the sampling rate for the affected service, and in the meantime lean on metrics (the service-map edge statistics are aggregated, not sampled-away) and structured logs keyed by request ID.

12. How do you make quota breaches a non-event? Treat quotas as capacity planning: enable Service Quotas usage metrics and Trusted Advisor limit checks, alarm at ~80% of every quota that scales with traffic, and request increases ahead of known peaks.

The certifications these questions map to:

Question theme	Maps to	Domain
Lifecycle, IC, mitigate-first, COE	SOA-C02	Monitoring, Logging & Remediation
Golden signals, correlation, X-Ray map	SOA-C02 / DOP-C02	Monitoring & Logging
Quotas, AZ design, blast radius	SAP-C02	Continuous improvement / org complexity
Blameless postmortem + CAPA	DOP-C02	Incident & Event Response

Quick check

What must you align across CloudWatch, CloudTrail, X-Ray and logs before you can correlate them?
In the incident lifecycle, what do you do before you fully understand the cause?
“Errors rise with traffic, recover when traffic drops” — what class of problem is this?
Which AWS construct, changed in one account, can deny actions across an entire Organization regardless of IAM?
Your application metrics are healthy but users can’t reach the site — name two things to check.

Answers

A single UTC timeline (and ideally a shared request/trace ID) — same clock and window across every tool.
Mitigate — roll back, fail over, raise a quota, shed load — to stop customer impact; diagnose afterwards from the telemetry.
A service-quota / limit breach (e.g. Lambda concurrency, API throttling), not an application bug.
A Service Control Policy (SCP) — an explicit deny in an SCP overrides any IAM allow across the OU/Organization.
DNS (dig +trace — wrong/empty answer, failover/health-check flip, deleted record) and the TLS certificate (openssl s_client — expired or wrong subject, usually ACM DNS-validation lapse). Edge/WAF is a valid third.

Glossary

Incident Commander (IC): The single person who coordinates an incident — decides, assigns and protects responders; does not debug.
Mitigation vs fix: A mitigation restores service now (rollback, failover, quota bump); a fix removes the underlying cause.
RCA (Root-Cause Analysis): The disciplined search for the true, systemic cause behind an incident, not merely its trigger.
COE (Correction of Error): Amazon’s blameless postmortem format — timeline, impact, five-whys, contributing factors, and CAPA action items.
CAPA (Corrective and Preventive Action): The tracked, owned, dated action items that prevent recurrence.
Golden signals: Latency, traffic, errors, saturation — the four metrics whose shape classifies a failure fast.
Correlation: Lining up signals from several tools on one UTC clock and window to locate a non-local cause.
Cascading failure: A small dependency problem amplified (by retries/missing timeouts) until it exhausts resources and spreads upstream.
Bulkhead / circuit breaker: Resilience patterns that isolate dependencies and fail fast so one slow downstream cannot take down the whole service.
Blast radius: The breadth of impact of a single change — widest, in AWS, for SCP/IAM/KMS-key-policy edits.
AZ impairment: A fault confined to one Availability Zone; survivable if you deploy across AZs with headroom.
Service Quota: An account/Region limit on a resource or API rate; a capacity-planning input, not an incident if monitored.
Personal Health Dashboard (PHD): The account-specific view of AWS health events affecting your resources.
Logs Insights: CloudWatch’s query language over log groups — your metric→detail drill-down tool.
X-Ray service map: The derived call graph across services, with per-edge latency/error/fault statistics — your “where is it failing” view.

Next steps

You now have the operational mindset for incidents that span services. Turn that operational knowledge into design with The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active, which shows how the resilience primitives you just used in mitigation — Multi-AZ, failover, headroom, bulkheads — are baked into architectures from the ground up, so the incidents in this lesson become non-events. Deepen the instrument plane with CloudWatch & CloudTrail Observability Deep Dive and X-Ray: Service Map, Segments & ADOT Tracing, the two tools you triangulate with most. Revisit the per-layer detail in AWS Troubleshooting Playbooks: EC2, VPC, IAM, S3 & Lambda, and when you are ready to certify, the AWS Certification Prep Kit (CLF, SAA, SOA, DVA, SAP, DOP) maps this lesson to the exact SOA-C02 and SAP-C02 domains it covers.

Advanced AWS Troubleshooting: Complex Multi-Service Incidents & Root-Cause Analysis

What problem this solves

Learning objectives

Prerequisites & where this fits

Core concepts

The vocabulary in one table

The incident-response lifecycle

The correlation toolkit: reading four tools against one clock

Logs Insights and CloudTrail: the two queries you will type most

The error & throttle reference

Worked scenario 1 — Cascading failure from a throttled dependency

Worked scenario 2 — Availability Zone impairment and failover

Worked scenario 3 — A service-quota breach under load

Worked scenario 4 — An IAM/SCP change with a wide blast radius

Worked scenario 5 — A Route 53 / DNS or certificate failure

Service Quotas, Trusted Advisor and the Health Dashboard as prevention

Closing the loop: the blameless COE

Architecture at a glance

Real-world scenario

Advantages and disadvantages

Hands-on lab — correlate a self-inflicted incident (Free Tier)

Common mistakes & troubleshooting

Best practices

Security notes

Cost & sizing

Interview & exam questions

Quick check

Answers

Glossary

Next steps

Written by Vinod

Comments