AWS Lesson 119 of 123

Advanced AWS Troubleshooting: Complex Multi-Service Incidents & Root-Cause Analysis

A single-service problem is a puzzle; a multi-service incident is a crime scene. When an EC2 instance will not accept SSH you can reason about it in isolation — security group, key, subnet, status checks — and the previous lesson gave you those playbooks. But production rarely fails that politely. Real incidents arrive as a payments API at 30% error rate, and the cause turns out to be a DynamoDB table that started throttling because a deploy three hours ago doubled its read pattern, which exhausted the connection pool in a downstream Lambda, which tripped a circuit breaker in the upstream service, which is why the checkout page — three hops away — is timing out. Nobody changed checkout. The symptom and the cause are in different services, owned by different teams, and the only thing that ties them together is a trace ID and a timeline.

This lesson is about that kind of incident. The skill is no longer “do I know this service” — it is correlation under pressure: imposing a lifecycle on the chaos, reading three or four observability tools against one clock, forming hypotheses you can cheaply disprove, and writing the whole thing down afterwards so it never recurs. We will walk the incident-response lifecycle, the correlation toolkit (CloudWatch metrics and Logs Insights, CloudTrail, X-Ray, the Health Dashboard, Trusted Advisor), five fully worked complex scenarios as symptom → hypotheses → cross-service diagnosis → fix, the role of Service Quotas, and the blameless postmortem (Amazon’s COE) that closes the loop. This is the SOA-C02 and SAP-C02 operational-excellence mindset, written the way an on-call architect actually works it.

Because this is a reference you will return to mid-incident, the playbook itself, the error codes, the golden-signal shapes, the quota ceilings and the scenario fingerprints are all laid out as scannable tables — read the prose once, then keep the tables open at 02:14. By the end you will stop staring at the failing service. You will localise a symptom to a hop on the request path, classify the failure shape in two minutes, and walk to the cause even when it lives in a service you did not touch.

What problem this solves

In a monolith, the failing function and the failing user are in the same process; a stack trace points at the line. In a distributed system the request fans across CloudFront, an ALB, several compute tiers, a queue, a database and three AWS-managed control planes — and the failure you see is almost never co-located with the failure that caused it. The 5xx surfaces in checkout; the throttle is on one DynamoDB partition; the trigger is a feature flag on a third service. Without a method, three engineers stare at the checkout dashboard while nobody talks to the customer, nobody mitigates, and nobody writes down what actually happened.

What breaks without this discipline: incidents run long because investigation has tunnel vision, mitigations are skipped in a rush to “understand it first,” signals get compared across mismatched clocks so people prove the wrong thing, and the postmortem names a person instead of the systemic gap — so the same outage recurs next quarter. The cost is measured in revenue per minute of customer impact and in the slow erosion of trust when the team cannot say why the last outage happened.

Who hits this: anyone running a real distributed system on AWS — SREs, on-call engineers, platform teams, and the architects who design the blast-radius. It bites hardest on teams without instrumentation for correlation (no request/trace IDs in logs, X-Ray off the critical path, no quota alarms), on multi-account organisations where an SCP edit in the management account can deny actions everywhere, and on cost-sensitive stacks that run a single AZ or a too-low Lambda reserved-concurrency. The fix is almost never “add more compute” — it is “impose a lifecycle, read four tools on one clock, and find the hop that is lying.”

To frame the whole field before the deep dive, here is every complex-incident class this lesson covers, the golden-signal shape that fingerprints it, the first tool to open, and the single most common root cause:

Incident class Golden-signal shape First question First tool to open Most common single cause
Cascading throttle (S1) Latency up first, then errors, traffic flat Is a downstream slow or rejecting? X-Ray service map A flag flipped QueryScan, hot-partition throttle
AZ impairment (S2) ~1/3 errors, 2/3 fine, persistent One AZ or one bad host? ALB metrics BY AZ + PHD A single Availability Zone degraded
Quota / limit (S3) Errors track traffic, vanish at rest Is this a bug or a ceiling? Service Quotas + Throttle metrics Concurrency / API-rate / ENI quota hit
IAM / SCP blast radius (S4) AccessDenied across many services, one minute Which policy layer denies? CloudTrail (mgmt account) An SCP/KMS/role edit with a wide reach
DNS / cert (S5) Dashboards green, users can’t load Is the failure upstream of me? dig +trace / openssl s_client ACM renewal lapsed (validation CNAME gone)

Learning objectives

By the end of this lesson you will be able to:

Prerequisites & where this fits

You should be comfortable with the single-service troubleshooting method and playbooks from AWS Troubleshooting Playbooks: EC2, VPC, IAM, S3 & Lambda — reproduce, isolate the layer, check config against desired state, inspect CloudWatch and CloudTrail, hypothesise, fix, verify, prevent. You should know IAM policy evaluation (explicit deny beats allow beats implicit deny) from IAM Fundamentals: Users, Roles, Policies & Evaluation, how VPC routing and security groups work, and roughly what CloudWatch, CloudTrail and X-Ray each record — the depth is in CloudWatch & CloudTrail Observability Deep Dive and X-Ray: Service Map, Segments & ADOT Tracing.

This lesson sits in the Troubleshooting & operations module of the Zero-to-Hero track, immediately after the single-service playbooks and before the architecting ladder. It is mapped to SOA-C02 (AWS Certified SysOps Administrator) and the operational sections of SAP-C02 (Solutions Architect Professional). The assumed knowledge, and where to brush it up if it’s rusty:

You should know… Why it matters here Brush up with
Single-service troubleshooting method Cross-service RCA builds on per-layer isolation Troubleshooting Playbooks: EC2, VPC, IAM, S3 & Lambda
IAM policy evaluation (explicit deny wins) Scenario 4 walks the evaluation chain IAM Fundamentals: Users, Roles, Policies & Evaluation
What CloudWatch / CloudTrail record They are two of your four correlation tools CloudWatch & CloudTrail Observability Deep Dive
Distributed tracing basics The X-Ray service map finds the fault edge X-Ray: Service Map, Segments & ADOT Tracing
VPC routing, security groups, AZs AZ impairment and connectivity reasoning VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints

A quick map of who confirms what during a cross-service incident, so you page the right owner fast:

Layer / plane What lives here Who usually owns it Failure classes it can cause
Edge / DNS (Route 53, CloudFront, ACM) Resolution, TLS, edge routing Frontend / SRE DNS misroute, expired cert (S5) — dashboards stay green
Ingress (ALB/NLB, WAF) L7 routing, health, AZ spread Network team Per-AZ 5xx (S2), WAF 403, listener/cert errors
Compute (ECS/EKS, Lambda, EC2) Your services, pools, concurrency App / platform team Cascades (S1), concurrency throttles (S3), crashes
Data (DynamoDB, RDS, ElastiCache) Stateful tier, capacity Data / platform team Throttles (S1), AZ-pinned primary (S2), saturation
Identity / policy (IAM, SCP, KMS) Permissions above the app Security / cloud platform Wide AccessDenied blast radius (S4)
Control / correlation (CloudWatch, CloudTrail, X-Ray) The evidence plane SRE / everyone Not a cause — your instrument; blind spots mislead

Core concepts

Five mental models make every later diagnosis obvious.

The symptom and the cause live in different services. The defining property of a multi-service incident is non-locality: the 5xx is in checkout, the throttle is on a DynamoDB partition, the trigger is a flag on a third service. So you never trust the failing service as the suspect. You start from the request path, find the failing edge (X-Ray), and walk toward the cause. “Where it hurts” and “what’s wrong” are different questions.

You read four tools against one clock. CloudWatch metrics tell you what changed and when; Logs Insights tells you what happened in a service; CloudTrail tells you who touched the control plane; X-Ray tells you where in the call graph the time and errors go. None of them is sufficient alone, and they only correlate if you put every console and query in UTC with the same time window. The single most common diagnostic error is comparing a log in local time against a metric in UTC and “proving” the wrong thing.

The golden-signal shape classifies the failure before you read a log. Latency, traffic, errors and saturation are the four golden signals, and their relative movement is a fingerprint. Errors up, latency flat, traffic flat is a fast-rejecting dependency. Latency up first, then errors, traffic flat is saturation or a slow dependency. Traffic up, then latency/errors up is load hitting a ceiling. A step change at an exact timestamp means something changed — go to CloudTrail for that minute. Reading the shape buys you the right hypothesis in the first two minutes.

Mitigation and fix are different jobs done in a fixed order. A mitigation restores service now (roll back, fail over, raise a quota, shed load, flip a flag); a fix removes the cause (correct the access pattern, repair the policy, restore ACM validation). During customer impact you mitigate first — the telemetry keeps the evidence — then diagnose and fix from a calm seat. Trying to find root cause before acting is the classic junior mistake that turns a five-minute incident into a two-hour one.

Most incidents are preventable hygiene, not fate. Three of the five scenarios below (quota, AZ, DNS/cert) are converted from outages into Tuesday-afternoon tickets by present-day alarms: a quota alarm at 80%, a per-AZ healthy-host alarm, a DaysToExpiry alarm, a CloudTrail alarm on sensitive policy events. “Convert future incidents into present-day alarms” is the whole prevention thesis.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters to multi-service RCA
Incident Commander (IC) The single coordinator; decides, assigns Org / on-call rotation One brain prevents three-theory thrash
Mitigation Restore service now (rollback/failover/quota) The incident bridge Stops the bleeding before you understand it
Fix Remove the underlying cause After mitigation The thing that stops recurrence
RCA Disciplined search for the systemic cause Post-mitigation Finds cause, not just trigger
Golden signals Latency, traffic, errors, saturation CloudWatch Their shape classifies the failure
Correlation Lining up signals on one UTC clock All four tools The core multi-service move
Blast radius Breadth of impact of one change IAM/SCP/KMS/DNS Widest for policy edits
Cascading failure Small hiccup amplified by retries Service-to-service edges Turns a blip into an outage
Service Quota An account/Region limit (rate, count) Service Quotas console A capacity input, not an incident — if monitored
AZ impairment A fault confined to one AZ One Availability Zone Survivable with cross-AZ headroom
COE Amazon’s blameless postmortem format After the incident Closes the loop with CAPA items
CAPA Owned, dated corrective/preventive actions Tracker An action without an owner is a wish

The incident-response lifecycle

Under pressure, ad-hoc investigation produces tunnel vision: three engineers staring at the same dashboard while nobody talks to the customer or writes anything down. A lifecycle is the antidote. It is not bureaucracy — it is the set of jobs that must happen concurrently, and naming them lets you assign them.

Phase Goal Key actions Done when
Detect Know there is an incident, fast Alarms (CloudWatch composite + SLO burn-rate), synthetic canaries, customer reports, Health Dashboard events An incident is declared with a severity and a single owner
Triage Size the blast radius and assign roles Confirm scope (one customer / one AZ / one Region / global), set severity, name an Incident Commander (IC), a comms lead and an ops lead Everyone knows their role and the severity is agreed
Communicate Keep stakeholders ahead of the rumour Status page / internal channel updates on a cadence (e.g. every 30 min for Sev-1), business-impact statement Stakeholders are updated and the cadence is running
Mitigate Stop customer pain before you understand it fully Roll back, fail over, raise a quota, shed load, disable a feature flag — anything that restores service Customer impact is gone or sharply reduced
RCA Find the true cause, not the trigger Correlate logs/metrics/traces/CloudTrail on one timeline, five-whys The causal chain is written down and agreed
Prevent Make recurrence impossible or detectable CAPA action items with owners and dates, new alarms, guardrails, runbook updates Action items are tracked to completion

Two principles separate seniors from juniors here. First: mitigate before you fully diagnose. The instinct to find the cause before acting is wrong during customer impact — if a rollback or failover restores service, do it, then investigate from a calm seat. The cause is still in the telemetry. Second: one Incident Commander. The IC does not debug; they coordinate, decide, and protect the responders from interruptions. The fastest incidents I have run had an IC who never touched a keyboard.

The three incident roles, and the trap when one person tries to wear two of them:

Role Owns Explicitly does NOT Failure mode if merged
Incident Commander Decisions, role assignment, severity, the bridge Debug, type commands If the IC debugs, nobody coordinates → tunnel vision
Comms lead Status page, exec/stakeholder updates on cadence Diagnose, decide mitigations If merged with IC, comms slip while debugging
Ops lead (responder) Run the diagnosis and mitigations Decide scope/severity, talk to execs If responders self-coordinate, three theories, no progress

A note on severity: pick a simple scale and stick to it. Severity drives the comms cadence and who gets paged — it is a routing decision, not a judgement of blame.

Severity Definition Page / staffing Comms cadence Example
Sev-1 Major customer-facing outage All hands, IC + exec comms Every 30 min Checkout down org-wide
Sev-2 Significant degradation or single-AZ/feature impact On-call + IC Every 60 min One AZ impaired, N+1 absorbs it
Sev-3 Minor or internal-only impact On-call engineer At resolution A non-critical dashboard stale
Sev-4 Cosmetic / no customer impact Backlog ticket None A typo in an internal runbook

The mitigate-then-fix sequencing is the move juniors get wrong most. Mitigations are fast and reversible and stop pain; fixes are slower and remove the cause. The same incident usually has both, and you do them in that order:

Incident class Mitigation (now, reversible) Fix (later, removes cause) Why the order matters
Throttle cascade (S1) Flip the flag off; table → on-demand Restore Query/GSI; add circuit breaker Flag-off stops bleeding in seconds; code fix takes a sprint
AZ impairment (S2) Fail traffic away from the AZ Add cross-AZ headroom; test failover You can’t “fix” an AZ — AWS does; you fix your tolerance
Quota breach (S3) Request increase; shed/queue load Right-size reservations; alarm at 80% Increase is minutes; capacity planning is ongoing
IAM/SCP (S4) Revert the policy edit Re-apply the intent narrowly via pipeline Revert restores access now; correct scoping needs review
DNS / cert (S5) Revert record; deploy a valid cert Restore ACM auto-renew; records-as-code Cert/record swap is immediate; renewal hygiene is durable
Deploy regression Roll back to the last good version Fix forward and re-deploy through CI Rollback is the fastest path to green

The correlation toolkit: reading four tools against one clock

The defining move of multi-service RCA is correlation. The cause lives in a different service from the symptom, so you must line up signals from several tools on a single, UTC timeline. Configure every console and query to UTC during an incident — the single most common diagnostic error is comparing a log in local time against a metric in UTC and “proving” the wrong thing.

Here is what each tool is for, and crucially what it is not for:

Tool Answers Best for Blind spot
CloudWatch metrics “What changed, and when?” Spotting the inflection point — latency, errors, throttles, saturation Aggregates hide per-request detail; high cardinality is expensive
CloudWatch Logs Insights “What exactly happened in this service?” Querying structured logs across log groups for patterns, error messages, request IDs Only as good as your logging; cross-service joins are manual
CloudTrail Who did what to the control plane, and when?” Tying an incident to a config change, deploy, IAM/SCP edit, quota change Management events only by default; data events cost extra; ~minutes of delivery latency
X-Ray (traces) “Where in the call graph is the time/error going?” Following one request across services; the service map shows the failing edge Sampled by default — the one trace you want may not be captured
AWS Health Dashboard “Is this AWS’s problem?” Ruling in/out an AWS-side event in your account/Region (PHD shows account-specific) Not real-time to the second; absence of an event is not proof
Trusted Advisor “Am I near a limit or misconfigured?” Pre-incident hygiene and quota headroom checks Periodic, not a live diagnostic

Before the shapes, the concrete metrics you actually overlay — which service, which metric, which golden signal it represents, and the number that means trouble:

Service Metric Golden signal “Bad” reading Pairs with
ALB TargetResponseTime Latency p99 climbing toward the LB timeout HTTPCode_Target_5XX_Count
ALB HTTPCode_ELB_5XX_Count Errors Non-zero (LB itself, not target) RequestCount, target health
ALB HealthyHostCount (by AZ) Saturation Drops in one AZ per-AZ 5XX concentration
Lambda Throttles Errors Rising while ConcurrentExecutions flat Invocations
Lambda ConcurrentExecutions Saturation Flatlines at 1,000 / reserved Throttles
DynamoDB ReadThrottleEvents / WriteThrottleEvents Errors Any non-zero ConsumedReadCapacityUnits
DynamoDB SuccessfulRequestLatency Latency Spiking on one operation throttle events
RDS DatabaseConnections Saturation Near max_connections CPUUtilization, ReadLatency
ECS / EKS CPUUtilization / pool depth Saturation Pool exhausted, CPU pinned upstream latency
Service Quotas AWS/Usage (per quota) Saturation > 80% of applied value the throttle metric it gates
API Gateway 5XXError / Latency Errors / Latency Spiking on one stage/route integration (Lambda) errors
SQS ApproximateAgeOfOldestMessage Saturation Climbing — consumer not draining consumer Throttles/errors

The high-leverage skill is the golden-signal sweep: in the first two minutes, look at latency, traffic, errors and saturation (the four golden signals) for the affected service and its immediate dependencies, all on the same time window. The shape tells you the class of problem before you read a single log line. Here is the full shape-to-class map you scan first:

Latency Traffic Errors Saturation It’s probably… Go straight to
Flat Flat Up Flat A dependency rejecting fast (throttle/4xx/bad deploy) The dependency’s throttle metrics
Up first Flat Up after Rising Saturation or a slow dependency; queue fills, then timeouts X-Ray service map, then pool/queue depth
Up Up Up Rising Load-driven — a capacity or quota ceiling Service Quotas + concurrency/throttle metrics
Step Flat Step Flat Something changed at an exact minute CloudTrail lookup-events for that minute
Flat Flat Up (one third) Flat in 2 AZs Single-AZ impairment ALB metrics BY AZ-ID + PHD
Flat Flat Flat (your view) Flat Failure is upstream (DNS/cert/edge) dig +trace, openssl s_client

A reading note that saves the most time — the difference between who emitted an error decides which logs you open:

Question The trap How to tell
Did my service return the 5xx, or a layer in front? An hour in the wrong logs If X-Ray/logs show the request succeeding (slowly) but the client got 502, the ALB/CloudFront emitted it on timeout
Is the throttle in my app or in AWS? “Scale up” masks it An AWS throttle names the service in the error (ThrottlingException, ProvisionedThroughputExceeded); an app limit does not
Did a human change something, or did infra fail? Chasing a config ghost CloudTrail quiet at the inflection minute is a strong signal the cause is infrastructure (AZ), not a change

Logs Insights and CloudTrail: the two queries you will type most

When errors spike, this Logs Insights query finds the dominant failure mode fast — bin by minute, count by status, and you see both the onset and the breakdown:

fields @timestamp, @message
| filter status >= 500 or @message like /Throttl|Timeout|ProvisionedThroughputExceeded/
| stats count(*) as errors by bin(1m), errorType
| sort errors desc

And when a metric shows a step change at a specific minute, this CloudTrail lookup answers “what changed”:

# Who/what touched the control plane around the inflection point (UTC)
aws cloudtrail lookup-events \
  --start-time 2026-06-15T14:25:00Z \
  --end-time   2026-06-15T14:35:00Z \
  --query 'Events[].{Time:EventTime,User:Username,Event:EventName,Src:EventSource}' \
  --output table

Pair a step change in a metric with a CloudTrail event in the same minute and you have usually found your trigger. For request-level correlation, propagate a request ID (and the X-Ray trace ID) through your logs so you can pivot from a metric, to the traces on that edge, to the exact log lines for that request. The exact CLI / console moves for each correlation step:

Step Tool Exact command / path What it proves
Find the inflection minute CloudWatch Metrics → set timezone UTC → note the step When it started
See the failing edge X-Ray Service map → click the red edge → trace list Where in the call graph
Break down the errors Logs Insights the stats count(*) by bin(1m), errorType query The dominant failure mode
Tie to a change CloudTrail aws cloudtrail lookup-events --start-time … --end-time … Who/what changed at that minute
Confirm AWS-side Health Dashboard PHD → events for your account/Region Whether it’s AWS, not you
Pivot to one request Logs + X-Ray filter logs by reqId / trace_id The exact request’s path and lines

The error & throttle reference

Before the worked scenarios, here is the lookup table you scan when an error string appears: the codes and exceptions you realistically see across services during a cross-service incident, what each means, the likely cause, how to confirm it, and the first fix. The non-obvious ones are the throttle exceptions (they name the service and scale with load) and the difference between a 502 the ALB emitted and a 5xx your app emitted.

Code / exception Where it appears Likely cause How to confirm First fix
ProvisionedThroughputExceededException DynamoDB SDK / logs Read/write past table or partition capacity ReadThrottleEvents/WriteThrottleEvents non-zero; hot partition On-demand or raise capacity; fix the access pattern
ThrottlingException / Rate exceeded Many AWS APIs Account/API request-rate quota hit under load Service Quotas usage metric vs applied value Backoff+jitter; request quota increase
TooManyRequestsException (Lambda 429) Lambda invoke Concurrency limit (account 1,000 or reserved) Throttles up while ConcurrentExecutions flat Raise reserved concurrency; queue with SQS
502 Bad Gateway ALB / CloudFront Target gave no/bad answer, or upstream timeout App trace shows request succeeding slowly → LB emitted it Speed up target; raise idle/keep-alive; fix health
503 Service Unavailable ALB No healthy target in the target group Target group HealthyHostCount = 0 (maybe one AZ) Restore healthy targets; check per-AZ
504 Gateway Timeout ALB / API Gateway Backend slower than the LB/integration timeout Backend p99 climbing toward the timeout value Speed up backend; raise timeout to match
AccessDenied S3/KMS/STS/most APIs IAM/SCP/resource-policy/KMS-key-policy deny CloudTrail errorCode: AccessDenied; trace the eval chain Revert the offending policy edit
AccessDeniedException (KMS) KMS-backed ops Key policy/grant removed Decrypt/GenerateDataKey CloudTrail PutKeyPolicy/RevokeGrant Restore the key-policy statement/grant
SERVFAIL / empty answer dig output DNS broken: bad record, delegation, health-check flip dig +trace, dig @8.8.8.8 Revert ChangeResourceRecordSets; fix NS
ERR_CERT_DATE_INVALID Browser / client Expired or wrong cert served at the edge openssl s_client notAfter/subject Deploy valid cert; restore ACM validation
5xx from your app App logs / X-Ray A real runtime exception in a handler Logs Insights filter status >= 500; stack trace Fix the throwing code path
ResourceLimitExceeded / LimitExceeded EC2/ENI/EIP APIs A resource-count quota (ENIs, EIPs, instances) Service Quotas applied value; Trusted Advisor limits Request increase; right-size; clean up leaks
RequestLimitExceeded EC2 control plane API call-rate throttling (describe storms) CloudTrail volume; SDK retry logs Backoff+jitter; cache describes; spread calls
InternalError / 5xx Any AWS API Transient AWS-side error PHD; retry succeeds Retry with backoff; it’s usually self-healing
ConditionalCheckFailedException DynamoDB Optimistic-lock condition not met (often expected) Is it in normal write paths? Usually benign; only chase if it spikes

Three reading notes that save the most time:

Distinction The trap How to tell them apart
LB-emitted 502 vs app-emitted 5xx Hours wasted in the wrong logs If X-Ray shows the request succeeding (slowly) but the client got 502, the load balancer emitted it on timeout
Throttle vs application bug “Scale up” masks the real ceiling A throttle names the service and tracks traffic (vanishes at rest); a bug does not scale with load
AccessDenied from IAM vs SCP/KMS Editing the wrong policy layer Many services denied at one minute points above the app — SCP (mgmt account) or KMS key policy, not per-service IAM

The five worked scenarios that follow are the five complex-incident classes you will actually meet. Before the detail, here they are side by side on the dimensions that distinguish them — so mid-incident you can match the fingerprint to the scenario and jump to the right playbook:

Scenario Distinguishing fingerprint Confirming tool CloudTrail at the minute Mitigation Durable fix
S1 — Throttle cascade Latency up first, then 5xx, traffic flat X-Ray red edge + ReadThrottleEvents A deploy/flag on a third service Kill the flag; on-demand; backoff Restore Query/GSI; circuit breaker; bulkheads
S2 — AZ impairment ~1/3 fail, 2/3 fine, persistent ALB HealthyHostCount by AZ + PHD Quiet (infra, not a change) Fail away from the AZ Cross-AZ N+1; tested Multi-AZ failover
S3 — Quota / limit Errors track traffic, vanish at rest Service Quotas + Throttles Maybe a lowered reservation Raise quota; queue; shed load Right-size concurrency; alarm at 80%
S4 — IAM/SCP blast radius Many services AccessDenied, one minute CloudTrail in mgmt account The policy edit, exactly Revert the policy change Policy-as-code with review pipeline
S5 — DNS / cert Dashboards green, users can’t load dig +trace / openssl s_client ChangeResourceRecordSets or none Revert record / deploy valid cert ACM auto-renew; record-as-code; canary

Worked scenario 1 — Cascading failure from a throttled dependency

Symptom. The checkout API’s p99 latency climbs from 200 ms to 8 s over ten minutes, then it starts returning 5xx. Traffic is flat — no marketing spike. The on-call for checkout swears they deployed nothing.

Hypotheses.

  1. Checkout itself regressed (deploy, memory leak, GC). Unlikely — no deploy, and latency rose before errors.
  2. A downstream dependency slowed or started rejecting, and checkout’s threads/connections are now blocked waiting on it (back-pressure cascade).
  3. A shared resource (a database, a connection pool, a Lambda concurrency limit) is saturated and several services are contending for it.

Cross-service diagnosis. The golden-signal shape — latency up first, then errors, traffic flat — points away from load and towards a slow or rejecting dependency. Open the X-Ray service map for the checkout request: it shows checkout → order-service → a DynamoDB table, and the order-service → DynamoDB edge is red with high latency and a fault rate. Pivot to CloudWatch and overlay the table’s ReadThrottleEvents (or ThrottledRequests) metric — it went non-zero exactly when checkout’s latency began to climb. Now the question is why is the table throttling on flat traffic? CloudTrail lookup-events for the preceding hour shows a deploy on a different service, and a Logs Insights query on order-service shows it is doing a Scan where it used to Query because a feature flag flipped. The Scan reads the whole partition, blows the table’s provisioned (or hot-partition) capacity, DynamoDB throttles, order-service retries (amplifying load), its connection pool fills, and checkout — waiting on order-service — backs up and finally times out. Classic cascade: the symptom is in checkout, the trigger is a flag on a third service, the bottleneck is one DynamoDB partition.

The cascade as a hop-by-hop chain, so you can see exactly where it amplified:

Hop What happens Golden signal here The amplifier
Flag flips on svc-C QueryScan on order-service Reads the whole partition
DynamoDB partition Capacity blown → throttle ReadThrottleEvents up Hot-partition limit
order-service Retries the throttled call Latency up, then errors Retries with no backoff multiply load
Connection pool Fills with blocked waiters Saturation up No bulkhead → shared pool exhausts
checkout Threads blocked on order-service p99 up, then 5xx No timeout / circuit breaker

Fix.

The resilience primitives that contain a cascade, and what each one stops:

Primitive What it does Stops which failure Cost / trade-off
Timeout Caps how long a call waits Threads pinned on a slow dependency Too tight → false failures
Exponential backoff + jitter Spreads and slows retries Retry storms amplifying a throttle Slightly higher tail latency
Circuit breaker Fails fast when a dependency is sick Cascade through the caller Needs tuning of open/half-open thresholds
Bulkhead Isolates dependencies into separate pools One downstream exhausting a shared pool More pools to size and monitor
Caching / DAX Cuts read volume to the hot path Hot-partition throttle Staleness; cache invalidation

The lesson: retries and missing timeouts turn a small dependency hiccup into a system-wide outage. Backoff with jitter, timeouts, circuit breakers and bulkheads are the resilience primitives that contain a cascade. The access-pattern depth is in DynamoDB Deep Dive: Tables, Keys, Capacity, GSIs & Streams.

Worked scenario 2 — Availability Zone impairment and failover

Symptom. Error rate jumps to roughly one-third of requests and p99 latency spikes, but two-thirds of requests are perfectly healthy. The pattern is partial and persistent — refreshes sometimes succeed, sometimes fail.

Hypotheses.

  1. A bad deploy on a subset of instances (canary gone wrong).
  2. One AZ is impaired — networking, a backing service, or capacity in that zone.
  3. A single unhealthy backend (one RDS replica, one cache node) serving a fraction of traffic.

Cross-service diagnosis. “One-third failing, two-thirds fine” across a three-AZ deployment is the signature of single-AZ impairment. Confirm it three ways. First, the AWS Health Dashboard (Personal Health Dashboard) — check for an account-specific event naming an AZ ID (note: AZ names like us-east-1a are randomised per account; the Health event and your metrics should be reconciled by AZ ID, e.g. use1-az2). Second, break the load balancer’s metrics down by AZ: target group HealthyHostCount has dropped in one AZ and HTTPCode_Target_5XX_Count is concentrated there. Third, check the data tier — if RDS is Multi-AZ, is the primary in the impaired zone? Is one ElastiCache shard’s primary there? CloudTrail will be quiet — this is not a change you made, which itself is a strong signal that the cause is infrastructure, not config.

The three confirmations, with the exact signal each gives:

Confirmation Where Exact signal Reads as
Health event PHD (Personal Health Dashboard) An open event naming an AZ ID (use1-az2) AWS has acknowledged the zone
Per-AZ LB metrics CloudWatch (ALB, dimension by AZ) HealthyHostCount ↓ and HTTPCode_Target_5XX_Count ↑ in one AZ Your traffic confirms the zone
Data-tier primary RDS / ElastiCache console Multi-AZ primary or shard primary sits in the impaired AZ The stateful blast radius
CloudTrail CloudTrail Quiet at the inflection minute Not a change you made → infra

The CLI to break ALB health down by AZ and to trigger an RDS failover:

# Healthy host count per AZ for one target group — the AZ with the drop is impaired
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB --metric-name HealthyHostCount \
  --dimensions Name=TargetGroup,Value=targetgroup/checkout/abc \
              Name=AvailabilityZone,Value=use1-az2 \
  --start-time 2026-06-15T14:00:00Z --end-time 2026-06-15T15:00:00Z \
  --period 60 --statistics Minimum --output table

# If the impaired AZ holds the RDS primary, fail over Multi-AZ (promotes the standby)
aws rds reboot-db-instance --db-instance-identifier orders-prod --force-failover

Fix.

What the architecture must already have for an AZ loss to be a non-event:

Guardrail Why How to verify What its absence causes
≥3 AZs with N+1 headroom Survive losing one zone with room to spare ASG spread; subnet-per-AZ Surviving AZs overload when one fails
Cross-zone load balancing Spread shifted load evenly ALB attribute enabled One AZ’s targets overload
RDS Multi-AZ + tested failover Promote standby fast Game-day a forced failover Stateful tier stuck in the bad AZ
Per-AZ alarms (not just aggregate) The aggregate hides a one-AZ drop Alarm on HealthyHostCount per AZ You learn of it from customers
Reconcile by AZ ID, not name AZ names are randomised per account Map us-east-1ause1-az2 You chase the wrong zone

The lesson: design assuming an AZ will fail, then an AZ failure is a non-event. The incident is only severe if your architecture cannot tolerate losing one zone. The Multi-AZ depth is in RDS & Aurora Deep Dive: Engines, Multi-AZ, Replicas & Backups, and the load-balancer mechanics in Elastic Load Balancing: ALB, NLB & GWLB Deep Dive.

Worked scenario 3 — A service-quota breach under load

Symptom. During a traffic surge, a fraction of API requests fail with errors that mention Rate exceeded, LimitExceeded, or ThrottlingException — and the failures track traffic: more load, more failures, and they vanish when load drops. Nothing is “down”; the system is hitting an invisible ceiling.

Hypotheses.

  1. An application-level rate limit or a downstream API throttle.
  2. A Lambda concurrency or burst limit (account or function-level reserved concurrency).
  3. An account Service Quota — API request rate, ENIs per Region, concurrent executions, EIPs, etc. — being exceeded as load scales.

Cross-service diagnosis. The “fails in proportion to load, recovers when load drops” shape is the fingerprint of a quota or limit, not a bug. Identify which ceiling. For Lambda, CloudWatch shows Throttles rising with invocations while ConcurrentExecutions flatlines at a round number (1,000 by default, or your reserved figure) — that is the concurrency limit. For API throttling, the error envelope names the service; check the Service Quotas console for that service’s “applied quota value” and compare against your CloudWatch usage metrics (many quotas now publish a usage metric you can alarm on). CloudTrail can show whether someone recently lowered a reserved concurrency or whether a new function ate the account pool. The tell that distinguishes this from scenario 1: here the errors are immediate throttles that scale with traffic, not latency-then-timeout from a slow dependency.

The common quota ceilings that bite under load, their default, and how each shows up:

Quota Typical default Metric / signal Symptom under load Mitigation
Lambda concurrent executions (account) 1,000 Throttles up, ConcurrentExecutions flat at 1,000 429 TooManyRequestsException Raise account quota; reserved per function
Lambda reserved concurrency (function) shared pool Throttles on one function only That function throttles, others fine Raise/remove the too-low reservation
API request rate (per service) service-specific ThrottlingException, Rate exceeded A fraction of calls 400/429 Backoff+jitter; quota increase
ENIs per Region account-specific ResourceLimitExceeded on scale-out New tasks/ENIs fail to launch Request increase; reduce ENI churn
Elastic IPs per Region 5 AddressLimitExceeded Cannot allocate an EIP Request increase; release unused EIPs
DynamoDB on-demand throughput account/table ProvisionedThroughputExceeded (burst) Throttles on a sudden spike Pre-warm; switch capacity mode

Fix.

The CLI to read an applied quota and to request an increase:

# Read the applied value for Lambda concurrent executions (quota L-B99A9384)
aws service-quotas get-service-quota \
  --service-code lambda --quota-code L-B99A9384 \
  --query 'Quota.{Name:QuotaName,Applied:Value}' --output table

# Request an increase (some auto-approve; others raise a Support case)
aws service-quotas request-service-quota-increase \
  --service-code lambda --quota-code L-B99A9384 --desired-value 3000

The lesson: quotas are a capacity-planning problem, not an incident. If you discover a quota during an outage, you have a monitoring gap, not just a limit. The concurrency mechanics are in Lambda Deep Dive: Runtimes, Triggers, Layers & Concurrency.

Worked scenario 4 — An IAM/SCP change with a wide blast radius

Symptom. At a precise timestamp, multiple unrelated services across one or several accounts start failing with AccessDenied — uploads to S3, a Lambda that can no longer write to DynamoDB, an ECS task that cannot pull from ECR. No application deploy happened. The breadth is the clue: many services, one moment, same error class.

Hypotheses.

  1. A KMS key policy or grant was changed and everything that decrypts through it now fails (a wide but specific blast radius).
  2. An IAM change — a shared role’s permissions, or a permission boundary tightened.
  3. A Service Control Policy (SCP) at the Organization/OU level changed and is now denying actions org-wide regardless of IAM (SCPs set the maximum permissions; an explicit deny there wins everywhere).

Cross-service diagnosis. When AccessDenied appears across many services at the same minute, suspect a policy layer above the application, because identity-based policies are usually edited per service. Go straight to CloudTrail and look up PutBucketPolicy, PutKeyPolicy, PutRolePolicy, DeleteRolePolicy, PutPermissionsBoundary, and crucially the Organizations events UpdatePolicy/AttachPolicy (SCPs are recorded in the management/delegated-admin account’s CloudTrail). The timestamp of the change will line up exactly with the onset. To prove which layer is denying, take one failing call and reason through the evaluation chain: an SCP deny blocks the action no matter what IAM allows, so if the IAM policy looks correct but the call still fails org-wide, the SCP (or a permission boundary, or a resource-policy/KMS-key-policy deny) is the culprit. CloudTrail’s errorCode: AccessDenied entries, combined with knowing that explicit deny always wins, let you walk from symptom to the exact policy edit.

The policy layers, ordered by reach, with the CloudTrail event that records an edit to each:

Layer Reach Explicit deny here means… CloudTrail event to look up Where the trail lives
SCP (Organizations) Whole OU / org Denied everywhere, regardless of IAM AttachPolicy, UpdatePolicy, DetachPolicy Management / delegated-admin account
Resource policy (S3 bucket, etc.) That resource Denied on that resource for all principals PutBucketPolicy, PutRepositoryPolicy The resource’s account
KMS key policy / grant Everything using the key Decrypt/encrypt fails for the named principals PutKeyPolicy, RevokeGrant, CreateGrant The key’s account
Permission boundary The bounded principal Caps that role even if its policy allows PutPermissionsBoundary, DeleteRolePermissionsBoundary The principal’s account
Identity policy (IAM) One user/role The usual per-service permission PutRolePolicy, AttachRolePolicy, DeleteRolePolicy The principal’s account

The decision table — given the breadth and the trail, which layer is the culprit:

If you see… And CloudTrail shows… It’s probably… Do this
Many services, many accounts, one minute AttachPolicy/UpdatePolicy in mgmt account An SCP edit Detach/correct the SCP in the management account
Everything that decrypts fails PutKeyPolicy/RevokeGrant A KMS key policy/grant change Restore the Decrypt/GenerateDataKey statement
One bucket denies all writers PutBucketPolicy A resource policy edit Revert the bucket policy statement
One role suddenly limited PutPermissionsBoundary A tightened boundary Loosen/correct the boundary narrowly
One service, one account PutRolePolicy/DeleteRolePolicy An identity policy edit Revert the role policy

Fix.

The CloudTrail lookup that finds the offending edit, by event name:

# Find SCP/role/key-policy edits in the inflection window (run in the mgmt account for SCPs)
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AttachPolicy \
  --start-time 2026-06-15T14:25:00Z --end-time 2026-06-15T14:35:00Z \
  --query 'Events[].{Time:EventTime,User:Username,Event:EventName}' --output table

The lesson: a one-line policy edit can have the widest blast radius in AWS. The layered evaluation model (SCP → resource policy → permission boundary → identity policy, with explicit deny trumping all) is exactly what you walk during diagnosis — and exactly why these changes belong in a reviewed pipeline. The governance depth is in Organizations, SCP Guardrails & Delegated Admin and KMS Encryption Deep Dive: Keys, Policies, Envelope & Rotation.

Worked scenario 5 — A Route 53 / DNS or certificate failure

Symptom. Users report “the site won’t load” or “your connection is not private,” but your load balancer and application metrics look completely healthy — low latency, no 5xx, normal CPU. The failure is happening before traffic reaches your infrastructure, which is why your dashboards are clean.

Hypotheses.

  1. DNS resolution is failing or returning the wrong answer (a record was changed/deleted, a failover/health-check flipped, an NS delegation issue).
  2. A TLS certificate expired or the wrong certificate is being served (ACM cert not renewed because validation lapsed; cert/domain mismatch).
  3. CDN/edge or WAF is blocking or misrouting at the edge.

Cross-service diagnosis. Healthy backend metrics with “can’t load” reports is the signature of an edge/DNS/cert problem. Split the two possibilities at the command line. For DNS, resolve the name and inspect the answer end to end:

dig +trace example.com           # follows the delegation chain NS by NS
dig example.com A @8.8.8.8       # what a public resolver actually returns

If dig returns the wrong IP, an empty answer, or SERVFAIL, the problem is DNS. In Route 53, check whether a health check flipped a failover record (a failing health check will route away from the healthy endpoint), whether a record was edited (CloudTrail ChangeResourceRecordSets shows the change and the actor), and whether the hosted zone’s NS records still match the registrar’s delegation. For TLS, inspect the served certificate’s expiry and subject:

echo | openssl s_client -connect example.com:443 -servername example.com 2>/dev/null \
  | openssl x509 -noout -dates -subject -issuer

An expired notAfter, or a subject/SAN that does not cover the host, is your cause. With ACM, expiry almost always traces back to DNS validation breaking (the CNAME validation record was removed, so ACM could not auto-renew) — and CloudTrail/ACM events will show the renewal failures.

The edge failure modes, the one command that confirms each, and the fix:

Failure mode Confirm with Tell-tale Mitigate Fix the cause
Record edited/deleted dig A @8.8.8.8; CloudTrail ChangeResourceRecordSets Wrong/empty answer Revert the record change Manage records as reviewed code
Failover health-check flipped Route 53 health-check status Traffic routed to the secondary/none Correct the health-check config Tune thresholds; right endpoint
NS delegation mismatch dig +trace; compare registrar NS Delegation chain breaks Fix registrar NS to match zone Document delegation; alarm on drift
Cert expired openssl s_client notAfter ERR_CERT_DATE_INVALID Deploy a valid cert now Let ACM auto-renew; keep CNAME
Wrong cert / SAN mismatch openssl s_client -subject Host not in subject/SAN Attach the correct cert Issue/replace cert covering the host
ACM auto-renew failed ACM console; CloudTrail renewal events Validation CNAME missing Re-add the validation CNAME Keep the DNS validation record intact

Fix.

The lesson: when your metrics are healthy but users cannot reach you, look outward — DNS, certificate, edge — because the failure is upstream of everything your CloudWatch sees. The routing and health-check depth is in Route 53: DNS Records, Routing Policies & Health Checks.

Service Quotas, Trusted Advisor and the Health Dashboard as prevention

Three of the five scenarios above (and most real incidents) are preventable with hygiene that lives outside the application:

The three prevention tools, what they catch, their cadence, and how to wire an alarm:

Tool Catches Cadence How to alarm / automate Cost
Service Quotas (usage metrics) Approaching a limit that scales with traffic Near-real-time metric CloudWatch alarm at 80% of AWS/Usage Free
Trusted Advisor (limit + FT checks) Service-limit headroom, single-AZ risk, open buckets Periodic (refresh) EventBridge on check status; weekly review Business/Enterprise Support for full set
AWS Health (PHD) AWS-side events for your resources/AZs Event-driven EventBridge rule → SNS/page Free
CloudWatch composite alarm Multi-signal SLO breach (burn rate) Real-time Composite of golden-signal alarms Per-alarm pricing

The pattern across all three: convert future incidents into present-day alarms. A quota you alarm on at 80% is a Tuesday-afternoon ticket; the same quota discovered at 100% during a launch is a Sev-1.

Closing the loop: the blameless COE

Prevention is only real once it is written down and tracked. Amazon’s Correction of Error (COE) is a blameless postmortem with a fixed shape — and the discipline is that every contributing factor produces a tracked, owned, dated CAPA action item. The sections, what each captures, and the trap if you skip it:

COE section What it captures Done well Trap if skipped
Summary One paragraph: what broke, who was impacted Plain, customer-framed Reads as internal jargon nobody acts on
Impact Duration, scope, requests/revenue affected Quantified, not “some users” Severity gets re-litigated later
Timeline Detection → mitigation → resolution on one UTC clock Minute-by-minute with actors “It was a blur” — no learning
Five-whys Trigger → … → systemic root cause Reaches a missing alarm/timeout/review Stops at the trigger; it recurs
Contributing factors Everything that made it worse or slower Honest, blameless Single-cause myth hides real gaps
CAPA action items Owned, dated corrective + preventive work One alarm/guardrail per factor A wish list nobody completes
Lessons learned What the team now knows Shared org-wide Knowledge stays in one head

The five-whys is where juniors stop too early. “DynamoDB throttled” is a trigger, not a root cause; keep asking until you reach the systemic gap — why was there no throttle alarm, no backoff, no load test on access-pattern changes? — because that gap is what the CAPA items must close.

Architecture at a glance

The diagram traces a real request as it actually flows and maps each of the five complex-incident classes onto the exact hop where it bites. Read it left to right. A request enters at the edge: Route 53 resolves the name and CloudFront terminates TLS with an ACM certificate — this is where DNS/cert failures (badge 1) strike, and the cruel part is that everything downstream stays green, so your dashboards lie. It passes into ingress, an ALB spread across three AZs with WAF in front — this is where a single-AZ impairment (badge 2) shows up as one-third of requests failing while the per-AZ HealthyHostCount drops in exactly one zone. It reaches compute: the checkout service calls the order service (ECS/EKS pools and Lambda concurrency), where a quota or concurrency ceiling (badge 3) throttles in proportion to load. The order service then hits the data tier — a DynamoDB table whose hot partition throttles and cascades back upstream (badge 4) when a flag flips Query to Scan, plus an RDS Multi-AZ instance whose primary may sit in the impaired zone.

Above and beneath the path runs the control & correlation planeCloudWatch (metrics + Logs Insights), CloudTrail (who changed what), and the X-Ray service map (the fault edge) — all read against one UTC clock. That plane is also where an IAM/SCP blast-radius change (badge 5) is diagnosed: a one-line policy edit denies many services at one minute, and only CloudTrail in the management account shows the edit. The whole method is in the picture: localise the symptom to a hop, read the golden-signal shape, run the named tool on one clock, and walk to the cause even when it lives a hop or a policy-layer away.

AWS multi-service incident request path with the five complex-incident classes mapped onto the hops where they bite: a left-to-right flow from the edge zone (Route 53 with A/ALIAS records and health checks, and CloudFront terminating TLS with an ACM certificate — badge 1, DNS and certificate failures that leave dashboards green), into the ingress zone (an ALB with per-AZ targets across three AZs and WAF — badge 2, single-AZ impairment showing as one-third of requests failing), into the compute zone (a checkout service on ECS or EKS with a connection pool of 200 and an order service on Lambda with concurrency 1,000 — badge 3, quota and concurrency throttles that track traffic), into the data zone (a DynamoDB orders table with a hot partition — badge 4, a throttle cascade back upstream — and an RDS Multi-AZ primary in AZ-b), and a control and correlation zone read on one UTC clock (CloudWatch metrics and Logs Insights, CloudTrail showing who changed what — badge 5, an IAM or SCP blast-radius edit denying many services at one minute — and the X-Ray service map highlighting the fault edge), with flows labelled HTTPS 443, HTTP to target, Query/UpdateItem, and trace plus throttle metrics feeding back from the correlation plane

Real-world scenario

Northwind Commerce runs an e-commerce platform on AWS in ap-south-1 (Mumbai): CloudFront → ALB (three AZs) → an ECS Fargate checkout service → a Lambda order service → a DynamoDB orders table, with RDS Multi-AZ (PostgreSQL) for the ledger and an SQS queue for async fulfilment. Traffic averages 600 requests/second with a 7pm festival-sale spike to ~2,400 rps. The platform team is five engineers across two squads; the monthly AWS spend is about ₹9,80,000.

The incident began on a Saturday festival sale. At 19:06 the SLO burn-rate alarm fired: checkout error rate crossing 5% and climbing. The first responder’s reflex was to look at the checkout service — CPU normal, no deploy, recent logs unremarkable. Meanwhile the on-call for the order squad swore they had not deployed either. Two squads, two dashboards, and for the first eight minutes no Incident Commander — three theories, no progress, exactly the failure the lifecycle exists to prevent.

The turn came when a senior engineer took the IC role, put every console in UTC, and ran the golden-signal sweep on checkout and its dependencies. The shape was unambiguous: checkout p99 rose first (200 ms → 6 s), then errors followed, with traffic only modestly up — a slow/rejecting dependency, not load. The X-Ray service map showed the order-service → DynamoDB edge glowing red with a fault rate. Overlaying the table’s ReadThrottleEvents in CloudWatch, the metric went non-zero at 19:04 — two minutes before the alarm. Now the real question: why throttle on near-flat traffic? A CloudTrail lookup-events for the preceding hour surfaced a UpdateFunctionConfiguration on a recommendations Lambda at 18:58, and a Logs Insights query on order-service showed it had started doing a Scan instead of a Query — a shared library upgrade in that deploy had flipped a feature flag that changed the access pattern. The Scan hammered one partition, DynamoDB throttled, order-service retried without backoff (amplifying the load), its connection pool filled, and checkout — blocked waiting on order-service — backed up and finally 5xx’d. The symptom was in checkout; the trigger was a deploy on a third service; the bottleneck was one DynamoDB partition.

Mitigation, in order: the IC had the recommendations squad revert the flag (removing the Scan instantly), flipped the orders table to on-demand to absorb the residual burst, and confirmed clients now had backoff with jitter via a config push. Error rate fell below 1% within four minutes of the flag revert. They did not scale the checkout service — that would have masked nothing and cost money, the classic reflex the team had been burned by before.

The durable fix landed the following week: restore the Query access pattern against the right GSI; add a circuit breaker in checkout so a slow order-service fails fast instead of consuming the whole pool; bound retries and set a 2-second client timeout; and add bulkheads so the order dependency cannot exhaust checkout’s shared pool. On prevention they wired three alarms — ReadThrottleEvents > 0, connection-pool saturation, and an end-to-end checkout canary — and added a CI gate requiring a load test for any access-pattern change. The next festival sale ran at 2,500 rps with zero DynamoDB throttles, checkout p95 held at 180 ms, and the COE’s headline line went on the wall: “The failing service is rarely the broken one — read the shape, follow the edge, find the flag.”

The incident as a timeline, because the order of moves is the lesson:

Time (UTC offset) Symptom Action taken Effect What it should have been
18:58 (none yet) recommendations deploy flips a flag Scan begins on order-service Load-test access-pattern changes in CI
19:04 ReadThrottleEvents > 0 (no alarm — gap) Throttle starts; latency builds An alarm here would pre-empt the page
19:06 Checkout error > 5% Burn-rate alarm fires; stare at checkout 8 min lost, no IC Name an IC immediately
19:14 Still climbing Senior takes IC, sweeps golden signals in UTC Shape = slow dependency The breakthrough
19:17 Edge identified X-Ray red edge + ReadThrottleEvents overlay Bottleneck = one DDB partition
19:20 Trigger found CloudTrail shows 18:58 deploy; Logs show Scan Cause = flipped flag
19:24 Mitigated Revert flag; table → on-demand; backoff push Errors < 1% Correct night-of mitigation
+1 week Fixed Query+GSI, circuit breaker, bulkheads, 3 alarms, canary 0 throttles at 2,500 rps The actual fix is code + alarms

Advantages and disadvantages

A disciplined, correlation-first incident method has clear strengths and real costs. Weigh it honestly before you mandate it:

Advantages (why this method pays off) Disadvantages (why it has a cost)
Finds the cause even when it’s a hop or a policy-layer away — the symptom service is rarely the broken one Requires instrumentation before the incident: structured logs with trace IDs, X-Ray on the critical path, quota alarms
The golden-signal shape classifies the failure in two minutes, before any log-reading Needs practised judgement — reading shapes well is a skill, not a checklist
One UTC clock across four tools removes the “wrong time zone” class of false conclusions Discipline slips under pressure; it takes drills (game days) to make it muscle memory
Mitigate-then-diagnose minimises customer impact; the telemetry keeps the evidence The instinct to “understand first” is strong; juniors resist mitigating blind
The IC role prevents three-theory thrash and protects responders A dedicated IC is staffing overhead small teams feel acutely
Blameless COE + CAPA closes the loop so incidents don’t recur Postmortems take time and only pay off if CAPA items are tracked to done
Quota/AZ/cert hygiene converts whole incident classes into Tuesday tickets Alarms and canaries cost a little to run and must be maintained (alarm fatigue is real)

The method is right for any team running a real distributed system where minutes of customer impact are expensive and incidents span services and teams. It is overkill for a single static site or a one-service hobby project — there, a stack trace is enough. The disadvantages are all front-loaded investment: instrument, alarm, drill. Pay it before the incident, or pay double during one.

Hands-on lab — correlate a self-inflicted incident (Free Tier)

You will create a tiny, safe incident and practise the correlation workflow end to end, then clean up. Everything here is Free-Tier-eligible if you tear it down promptly.

1. Set up a function and a log group. Create a minimal Lambda (any runtime) named coe-lab that logs a structured line and occasionally “fails”:

aws lambda create-function --function-name coe-lab \
  --runtime python3.13 --handler index.handler --timeout 5 \
  --role arn:aws:iam::111122223333:role/your-lambda-basic-role \
  --zip-file fileb://function.zip

(function.zip contains an index.py whose handler prints {"level":"INFO","reqId":context.aws_request_id,...} and raises an exception when the event has {"fail": true}.)

2. Generate a signal. Invoke it a few times, including failures, to put both success and error lines into CloudWatch Logs:

for i in $(seq 1 20); do
  aws lambda invoke --function-name coe-lab --payload '{"fail":false}' /dev/null >/dev/null
done
aws lambda invoke --function-name coe-lab --payload '{"fail":true}' /dev/null

3. Triage with metrics. In the CloudWatch console, open the Errors, Invocations and Throttles metrics for coe-lab (set the timezone to UTC). Note the minute the error appears — that is your inflection point.

4. Drill in with Logs Insights. Run a query against the function’s log group:

fields @timestamp, @message
| filter @message like /ERROR|Traceback|"level":"ERROR"/
| stats count(*) as errors by bin(1m)
| sort @timestamp desc

Confirm the error count and timing match the metric. This is the metric→logs pivot you will do in every real incident.

5. Tie it to a change with CloudTrail. Update the function’s configuration (a harmless change) to create a control-plane event:

aws lambda update-function-configuration --function-name coe-lab --timeout 6
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=coe-lab \
  --query 'Events[].{Time:EventTime,User:Username,Event:EventName}' --output table

You should see your UpdateFunctionConfiguration event with its timestamp and actor — exactly how you would prove “someone changed this at 14:31”.

6. Read a quota’s applied value. Practise the scenario-3 move — see how much headroom you have on Lambda concurrency:

aws service-quotas get-service-quota \
  --service-code lambda --quota-code L-B99A9384 \
  --query 'Quota.{Name:QuotaName,Applied:Value}' --output table

Validation. You have, on one UTC timeline, (a) a metric inflection, (b) the matching log lines, © the control-plane change that explains a config-driven incident, and (d) a read of the quota that scenario 3 would breach. That is the core RCA loop in miniature.

Each lab step mapped to the real incident move it rehearses:

Lab step Real-incident move Tool
3 — note the inflection minute (UTC) Golden-signal sweep CloudWatch metrics
4 — Logs Insights error count by minute Metric → logs pivot Logs Insights
5 — CloudTrail lookup-events Tie a step change to a change CloudTrail
6 — read the applied quota Quota headroom check (S3) Service Quotas

Cleanup.

aws lambda delete-function --function-name coe-lab
aws logs delete-log-group --log-group-name /aws/lambda/coe-lab

Cost note. A handful of Lambda invocations and a few Logs Insights queries fall within the Free Tier; the log group stores a trivial amount. CloudTrail management events are free to view via lookup-events. Deleting the function and log group leaves nothing billable. If you ever enable CloudTrail data events or create a CloudWatch alarm at scale, those can incur small charges — not needed for this lab.

Common mistakes & troubleshooting

This is the differentiator. Cross-service RCA fails for process reasons as often as technical ones — and the technical traps recur. The playbook below is the table to keep open at 02:14: match the symptom of your investigation or the incident, read the cause, run the exact confirm, apply the fix.

# Symptom (of your process or the incident) Root cause Confirm (exact command / path) Fix
1 Investigation goes in circles; the “fix” doesn’t hold Comparing signals in different time zones Check each console/query header — is it local or UTC? Put every console and query in UTC; one clock, one window
2 You debug 40 min while customers suffer Trying to find root cause before mitigating Is customer impact still live and unmitigated? Mitigate first (roll back / fail over / raise quota), diagnose after
3 Three people, three theories, no progress No Incident Commander Is anyone coordinating vs everyone debugging? Name an IC who decides and coordinates and does not debug
4 You “fixed” the trigger but it recurs Treated the trigger as the cause Did the five-whys reach a systemic gap (no alarm/timeout)? Five-whys to the systemic cause; add the missing guardrail
5 The one trace you need isn’t in X-Ray Default sampling dropped it X-Ray sampling rules show the default reservoir Raise sampling for the affected service; lean on metrics+logs meanwhile
6 CloudTrail “shows nothing” around the event Wrong account, or only management events Are you in the mgmt/delegated-admin account? Window wide enough? Check the org trail; widen for delivery latency; enable data events if needed
7 You blame the failing service Symptom and cause are in different services Does the X-Ray map show the fault on a downstream edge? Follow the service map edge; read golden-signal shapes
8 You scaled up and it “worked,” then returned worse Masked a resource ceiling (quota/throttle/saturation) Do errors track traffic (vanish at rest)? Find the ceiling (Service Quotas / throttle metric); fix, don’t mask
9 502s blamed on the app, but app logs are clean The load balancer emitted the 502 on timeout X-Ray shows the request succeeding slowly; client got 502 Speed up the target; align LB/integration timeout
10 “One-third fail, two-thirds fine” chased as a bad host Single-AZ impairment ALB HealthyHostCount by AZ-ID; PHD event Fail away from the AZ; rely on cross-AZ N+1 headroom
11 Many services AccessDenied, you edit each IAM policy The deny is a layer above IAM (SCP/KMS) CloudTrail AttachPolicy/PutKeyPolicy in mgmt account Revert the SCP/key-policy edit — explicit deny wins
12 Dashboards green but users can’t load; you check the app Failure is upstream (DNS/cert/edge) dig +trace; openssl s_client … -dates Restore the record / ACM validation CNAME / valid cert
13 Postmortem names a person Blameful culture Does the COE ask “who” or “what about the system”? Run a blameless COE; fix the system that let it happen
14 Quota discovered at 100% during a launch No quota monitoring Is there an alarm at 80% of this quota? Alarm on the Service Quotas usage metric at ~80%

Best practices

Security notes

The security-relevant control-plane events to alarm on, and why each matters during an incident:

Event Service Why alarm on it During an incident it tells you
AttachPolicy / UpdatePolicy Organizations (SCP) Widest blast radius in AWS An org-wide deny was just introduced
PutKeyPolicy / RevokeGrant KMS Breaks everything that decrypts Why many services lost Decrypt
PutRolePolicy / DeleteRolePolicy IAM Per-service permission change A specific role gained/lost access
ChangeResourceRecordSets Route 53 DNS hijack / outage vector Who edited the record and to what
AuthorizeSecurityGroupIngress EC2 An incident-time “just to test” hole A security group was opened up
StopLogging / DeleteTrail CloudTrail An attacker covering tracks Your evidence source was tampered with
PutBucketPolicy / PutBucketAcl S3 Can expose a bucket publicly A data-exposure change just landed
ConsoleLogin (failure / new region) IAM (CloudTrail) Credential misuse signal Who logged in, from where, success or not

Cost & sizing

The incident method itself is nearly free — the cost is in the instrumentation that makes correlation possible, and the trade-off is “spend a little continuously to avoid spending a lot during an outage.” What drives the observability bill, and how to right-size it:

Cost driver What it is Rough figure Right-size by
CloudWatch custom metrics Per-metric monthly charge ~$0.30/metric/mo (first tier) Emit only the golden signals + key dependencies
CloudWatch Logs ingestion Per-GB ingested ~$0.50–0.57/GB (Region-dependent) Sample/structure logs; drop debug in prod
Logs Insights queries Per-GB scanned ~$0.005/GB scanned Narrow the time window; filter early
X-Ray traces Per-trace recorded/retrieved ~$5 per 1M traces recorded Sample the critical path; raise only during incidents
CloudTrail management events First copy of mgmt events Free Always on; it’s your evidence
CloudTrail data events S3/DynamoDB object-level ~$0.10 per 100K events Enable only on sensitive buckets/tables
CloudWatch alarms Per-alarm monthly ~$0.10/alarm/mo (standard) Alarm on signals + quotas, not everything
Synthetics canaries Per canary run ~$0.0012/run One end-to-end canary per critical journey

A sizing rule of thumb: the four-golden-signal alarms plus a quota alarm at 80% plus one end-to-end canary per critical user journey is a few hundred rupees a month for most stacks — and it is the difference between a Sev-1 and a Tuesday ticket. The expensive mistake is the opposite: no instrumentation, then a multi-hour outage whose revenue cost dwarfs a year of observability spend. Free Tier covers the lab here entirely; production observability scales with traffic, but the golden-signal subset keeps it bounded.

The cost-of-an-incident framing, which is what justifies the spend:

If you skip… The incident it enables Rough cost of that incident
A quota alarm at 80% A launch-time Sev-1 throttle Revenue per minute × outage minutes
Per-AZ healthy-host alarms A one-AZ impairment chased blind Extended MTTR; possible full outage if N+1 absent
A DaysToExpiry cert alarm A site-wide cert expiry Total outage until a cert is reissued
An external DNS/cert canary An edge failure your dashboards can’t see Customer-reported outage; reputational hit
X-Ray on the critical path A cascade you can’t localise Hours of MTTR finding the fault edge
A blameless COE + tracked CAPA The same incident next quarter Repeat outage at full cost — the worst spend of all

Interview & exam questions

1. Walk me through how you run a multi-service incident. Detect → triage (scope, severity, name an IC and comms lead) → communicate on a cadence → mitigate before fully diagnosing → RCA by correlating metrics/logs/traces/CloudTrail on one UTC timeline → prevent via a blameless COE with tracked CAPA items.

2. The symptom is in service A but A didn’t change — how do you find the cause? Use the X-Ray service map to find the failing edge, read the four golden signals for A and its dependencies to classify the failure shape, then pivot to the dependency’s metrics/logs and to CloudTrail for any change at the inflection minute. The cause is usually a downstream throttle, a saturated shared resource, or a config change elsewhere.

3. Errors rise in proportion to traffic and vanish when load drops — what is it, usually? A quota or limit (Lambda concurrency, API request rate, ENIs, etc.), not a bug. Identify it via the throttle metrics and the Service Quotas console; mitigate with a quota increase / load shedding / queueing; prevent by alarming at 80% of quota.

4. One-third of requests fail, two-thirds are fine, persistently — what’s your first hypothesis? Single-AZ impairment on a three-AZ deployment. Confirm with the Personal Health Dashboard, per-AZ load-balancer/target metrics (by AZ ID, since AZ names are randomised per account), and the data tier (is an RDS/cache primary in that AZ). Fail traffic away and rely on cross-AZ headroom.

5. Difference between a mitigation and a fix — give an example. A mitigation stops customer pain now (roll back the deploy, fail over the AZ, raise the quota); a fix removes the cause (correct the access pattern, fix the policy, restore ACM validation). You mitigate first, then fix; both belong in the COE.

6. Multiple services across accounts throw AccessDenied at the same minute — what changed? Almost certainly a policy layer above the app: an SCP (in the management/delegated-admin account), a KMS key policy, or a shared role/permission boundary. CloudTrail (PutKeyPolicy, Organizations AttachPolicy/UpdatePolicy, PutRolePolicy) shows the edit; explicit deny wins, so revert the change to mitigate.

7. Your dashboards are green but users say the site won’t load — where do you look? Outward — DNS and certificates and edge. dig +trace to validate resolution and openssl s_client to check the served cert’s expiry/subject. Common cause: ACM auto-renewal failed because the DNS validation CNAME was removed; or a Route 53 record/health-check flipped.

8. What’s a cascading failure and how do you prevent it? A small dependency hiccup amplified by retries and missing timeouts until threads/connections exhaust and the failure spreads upstream. Prevent with timeouts, exponential backoff with jitter, circuit breakers and bulkheads, plus alarms on dependency throttles and saturation.

9. How do you correlate a metric spike to a specific change? Note the exact UTC minute of the inflection, then aws cloudtrail lookup-events for that window. A control-plane event in the same minute (a deploy, a quota edit, a policy change) is your trigger.

10. Why blameless postmortems? Because blame drives information underground — people stop sharing what really happened, and you fix symptoms instead of the systemic gaps (missing alarm, no timeout, click-ops policy edit) that let a human error become an outage. The COE asks “what about the system allowed this?”, not “who did it?”

11. X-Ray is sampled — what if the failing trace wasn’t captured? Temporarily raise the sampling rate for the affected service, and in the meantime lean on metrics (the service-map edge statistics are aggregated, not sampled-away) and structured logs keyed by request ID.

12. How do you make quota breaches a non-event? Treat quotas as capacity planning: enable Service Quotas usage metrics and Trusted Advisor limit checks, alarm at ~80% of every quota that scales with traffic, and request increases ahead of known peaks.

The certifications these questions map to:

Question theme Maps to Domain
Lifecycle, IC, mitigate-first, COE SOA-C02 Monitoring, Logging & Remediation
Golden signals, correlation, X-Ray map SOA-C02 / DOP-C02 Monitoring & Logging
Quotas, AZ design, blast radius SAP-C02 Continuous improvement / org complexity
Blameless postmortem + CAPA DOP-C02 Incident & Event Response

Quick check

  1. What must you align across CloudWatch, CloudTrail, X-Ray and logs before you can correlate them?
  2. In the incident lifecycle, what do you do before you fully understand the cause?
  3. “Errors rise with traffic, recover when traffic drops” — what class of problem is this?
  4. Which AWS construct, changed in one account, can deny actions across an entire Organization regardless of IAM?
  5. Your application metrics are healthy but users can’t reach the site — name two things to check.

Answers

  1. A single UTC timeline (and ideally a shared request/trace ID) — same clock and window across every tool.
  2. Mitigate — roll back, fail over, raise a quota, shed load — to stop customer impact; diagnose afterwards from the telemetry.
  3. A service-quota / limit breach (e.g. Lambda concurrency, API throttling), not an application bug.
  4. A Service Control Policy (SCP) — an explicit deny in an SCP overrides any IAM allow across the OU/Organization.
  5. DNS (dig +trace — wrong/empty answer, failover/health-check flip, deleted record) and the TLS certificate (openssl s_client — expired or wrong subject, usually ACM DNS-validation lapse). Edge/WAF is a valid third.

Glossary

Next steps

You now have the operational mindset for incidents that span services. Turn that operational knowledge into design with The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active, which shows how the resilience primitives you just used in mitigation — Multi-AZ, failover, headroom, bulkheads — are baked into architectures from the ground up, so the incidents in this lesson become non-events. Deepen the instrument plane with CloudWatch & CloudTrail Observability Deep Dive and X-Ray: Service Map, Segments & ADOT Tracing, the two tools you triangulate with most. Revisit the per-layer detail in AWS Troubleshooting Playbooks: EC2, VPC, IAM, S3 & Lambda, and when you are ready to certify, the AWS Certification Prep Kit (CLF, SAA, SOA, DVA, SAP, DOP) maps this lesson to the exact SOA-C02 and SAP-C02 domains it covers.

awstroubleshootingincident-responseroot-cause-analysisobservabilitysre
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments