A single-service problem is a puzzle; a multi-service incident is a crime scene. When an EC2 instance will not accept SSH you can reason about it in isolation — security group, key, subnet, status checks — and the previous lesson gave you those playbooks. But production rarely fails that politely. Real incidents arrive as a payments API at 30% error rate, and the cause turns out to be a DynamoDB table that started throttling because a deploy three hours ago doubled its read pattern, which exhausted the connection pool in a downstream Lambda, which tripped a circuit breaker in the upstream service, which is why the checkout page — three hops away — is timing out. Nobody changed checkout. The symptom and the cause are in different services, owned by different teams, and the only thing that ties them together is a trace ID and a timeline.
This lesson is about that kind of incident. The skill is no longer “do I know this service” — it is correlation under pressure: imposing a lifecycle on the chaos, reading three or four observability tools against one clock, forming hypotheses you can cheaply disprove, and writing the whole thing down afterwards so it never recurs. We will walk the incident-response lifecycle, the correlation toolkit (CloudWatch metrics and Logs Insights, CloudTrail, X-Ray, the Health Dashboard, Trusted Advisor), five fully worked complex scenarios as symptom → hypotheses → cross-service diagnosis → fix, the role of Service Quotas, and the blameless postmortem (Amazon’s COE) that closes the loop. This is the SOA-C02 and SAP-C02 operational-excellence mindset, written the way an on-call architect actually works it.
Because this is a reference you will return to mid-incident, the playbook itself, the error codes, the golden-signal shapes, the quota ceilings and the scenario fingerprints are all laid out as scannable tables — read the prose once, then keep the tables open at 02:14. By the end you will stop staring at the failing service. You will localise a symptom to a hop on the request path, classify the failure shape in two minutes, and walk to the cause even when it lives in a service you did not touch.
What problem this solves
In a monolith, the failing function and the failing user are in the same process; a stack trace points at the line. In a distributed system the request fans across CloudFront, an ALB, several compute tiers, a queue, a database and three AWS-managed control planes — and the failure you see is almost never co-located with the failure that caused it. The 5xx surfaces in checkout; the throttle is on one DynamoDB partition; the trigger is a feature flag on a third service. Without a method, three engineers stare at the checkout dashboard while nobody talks to the customer, nobody mitigates, and nobody writes down what actually happened.
What breaks without this discipline: incidents run long because investigation has tunnel vision, mitigations are skipped in a rush to “understand it first,” signals get compared across mismatched clocks so people prove the wrong thing, and the postmortem names a person instead of the systemic gap — so the same outage recurs next quarter. The cost is measured in revenue per minute of customer impact and in the slow erosion of trust when the team cannot say why the last outage happened.
Who hits this: anyone running a real distributed system on AWS — SREs, on-call engineers, platform teams, and the architects who design the blast-radius. It bites hardest on teams without instrumentation for correlation (no request/trace IDs in logs, X-Ray off the critical path, no quota alarms), on multi-account organisations where an SCP edit in the management account can deny actions everywhere, and on cost-sensitive stacks that run a single AZ or a too-low Lambda reserved-concurrency. The fix is almost never “add more compute” — it is “impose a lifecycle, read four tools on one clock, and find the hop that is lying.”
To frame the whole field before the deep dive, here is every complex-incident class this lesson covers, the golden-signal shape that fingerprints it, the first tool to open, and the single most common root cause:
| Incident class | Golden-signal shape | First question | First tool to open | Most common single cause |
|---|---|---|---|---|
| Cascading throttle (S1) | Latency up first, then errors, traffic flat | Is a downstream slow or rejecting? | X-Ray service map | A flag flipped Query→Scan, hot-partition throttle |
| AZ impairment (S2) | ~1/3 errors, 2/3 fine, persistent | One AZ or one bad host? | ALB metrics BY AZ + PHD | A single Availability Zone degraded |
| Quota / limit (S3) | Errors track traffic, vanish at rest | Is this a bug or a ceiling? | Service Quotas + Throttle metrics | Concurrency / API-rate / ENI quota hit |
| IAM / SCP blast radius (S4) | AccessDenied across many services, one minute |
Which policy layer denies? | CloudTrail (mgmt account) | An SCP/KMS/role edit with a wide reach |
| DNS / cert (S5) | Dashboards green, users can’t load | Is the failure upstream of me? | dig +trace / openssl s_client |
ACM renewal lapsed (validation CNAME gone) |
Learning objectives
By the end of this lesson you will be able to:
- Run an incident through a disciplined lifecycle — detect, triage, communicate, mitigate, perform root-cause analysis (RCA), prevent — and know what “done” means at each stage.
- Correlate CloudWatch metrics, CloudWatch Logs Insights queries, CloudTrail events and X-Ray traces against a single UTC timeline to locate a cause that is not in the failing service.
- Read the four golden signals (latency, traffic, errors, saturation) as shapes that classify a failure before you open a single log line.
- Diagnose and fix five classes of complex incident: cascading failure from a throttled dependency, Availability Zone (AZ) impairment, a service-quota breach under load, an IAM/SCP change with a wide blast radius, and a Route 53/DNS or ACM certificate failure.
- Tell the difference between a mitigation (stop the bleeding now) and a fix (remove the cause) and sequence them correctly.
- Use Service Quotas, Trusted Advisor and the AWS Health Dashboard proactively so the next breach is a CloudWatch alarm, not an outage.
- Write a blameless Correction of Error (COE) with a real five-whys, contributing factors and CAPA action items.
Prerequisites & where this fits
You should be comfortable with the single-service troubleshooting method and playbooks from AWS Troubleshooting Playbooks: EC2, VPC, IAM, S3 & Lambda — reproduce, isolate the layer, check config against desired state, inspect CloudWatch and CloudTrail, hypothesise, fix, verify, prevent. You should know IAM policy evaluation (explicit deny beats allow beats implicit deny) from IAM Fundamentals: Users, Roles, Policies & Evaluation, how VPC routing and security groups work, and roughly what CloudWatch, CloudTrail and X-Ray each record — the depth is in CloudWatch & CloudTrail Observability Deep Dive and X-Ray: Service Map, Segments & ADOT Tracing.
This lesson sits in the Troubleshooting & operations module of the Zero-to-Hero track, immediately after the single-service playbooks and before the architecting ladder. It is mapped to SOA-C02 (AWS Certified SysOps Administrator) and the operational sections of SAP-C02 (Solutions Architect Professional). The assumed knowledge, and where to brush it up if it’s rusty:
| You should know… | Why it matters here | Brush up with |
|---|---|---|
| Single-service troubleshooting method | Cross-service RCA builds on per-layer isolation | Troubleshooting Playbooks: EC2, VPC, IAM, S3 & Lambda |
| IAM policy evaluation (explicit deny wins) | Scenario 4 walks the evaluation chain | IAM Fundamentals: Users, Roles, Policies & Evaluation |
| What CloudWatch / CloudTrail record | They are two of your four correlation tools | CloudWatch & CloudTrail Observability Deep Dive |
| Distributed tracing basics | The X-Ray service map finds the fault edge | X-Ray: Service Map, Segments & ADOT Tracing |
| VPC routing, security groups, AZs | AZ impairment and connectivity reasoning | VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints |
A quick map of who confirms what during a cross-service incident, so you page the right owner fast:
| Layer / plane | What lives here | Who usually owns it | Failure classes it can cause |
|---|---|---|---|
| Edge / DNS (Route 53, CloudFront, ACM) | Resolution, TLS, edge routing | Frontend / SRE | DNS misroute, expired cert (S5) — dashboards stay green |
| Ingress (ALB/NLB, WAF) | L7 routing, health, AZ spread | Network team | Per-AZ 5xx (S2), WAF 403, listener/cert errors |
| Compute (ECS/EKS, Lambda, EC2) | Your services, pools, concurrency | App / platform team | Cascades (S1), concurrency throttles (S3), crashes |
| Data (DynamoDB, RDS, ElastiCache) | Stateful tier, capacity | Data / platform team | Throttles (S1), AZ-pinned primary (S2), saturation |
| Identity / policy (IAM, SCP, KMS) | Permissions above the app | Security / cloud platform | Wide AccessDenied blast radius (S4) |
| Control / correlation (CloudWatch, CloudTrail, X-Ray) | The evidence plane | SRE / everyone | Not a cause — your instrument; blind spots mislead |
Core concepts
Five mental models make every later diagnosis obvious.
The symptom and the cause live in different services. The defining property of a multi-service incident is non-locality: the 5xx is in checkout, the throttle is on a DynamoDB partition, the trigger is a flag on a third service. So you never trust the failing service as the suspect. You start from the request path, find the failing edge (X-Ray), and walk toward the cause. “Where it hurts” and “what’s wrong” are different questions.
You read four tools against one clock. CloudWatch metrics tell you what changed and when; Logs Insights tells you what happened in a service; CloudTrail tells you who touched the control plane; X-Ray tells you where in the call graph the time and errors go. None of them is sufficient alone, and they only correlate if you put every console and query in UTC with the same time window. The single most common diagnostic error is comparing a log in local time against a metric in UTC and “proving” the wrong thing.
The golden-signal shape classifies the failure before you read a log. Latency, traffic, errors and saturation are the four golden signals, and their relative movement is a fingerprint. Errors up, latency flat, traffic flat is a fast-rejecting dependency. Latency up first, then errors, traffic flat is saturation or a slow dependency. Traffic up, then latency/errors up is load hitting a ceiling. A step change at an exact timestamp means something changed — go to CloudTrail for that minute. Reading the shape buys you the right hypothesis in the first two minutes.
Mitigation and fix are different jobs done in a fixed order. A mitigation restores service now (roll back, fail over, raise a quota, shed load, flip a flag); a fix removes the cause (correct the access pattern, repair the policy, restore ACM validation). During customer impact you mitigate first — the telemetry keeps the evidence — then diagnose and fix from a calm seat. Trying to find root cause before acting is the classic junior mistake that turns a five-minute incident into a two-hour one.
Most incidents are preventable hygiene, not fate. Three of the five scenarios below (quota, AZ, DNS/cert) are converted from outages into Tuesday-afternoon tickets by present-day alarms: a quota alarm at 80%, a per-AZ healthy-host alarm, a DaysToExpiry alarm, a CloudTrail alarm on sensitive policy events. “Convert future incidents into present-day alarms” is the whole prevention thesis.
The vocabulary in one table
Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:
| Concept | One-line definition | Where it lives | Why it matters to multi-service RCA |
|---|---|---|---|
| Incident Commander (IC) | The single coordinator; decides, assigns | Org / on-call rotation | One brain prevents three-theory thrash |
| Mitigation | Restore service now (rollback/failover/quota) | The incident bridge | Stops the bleeding before you understand it |
| Fix | Remove the underlying cause | After mitigation | The thing that stops recurrence |
| RCA | Disciplined search for the systemic cause | Post-mitigation | Finds cause, not just trigger |
| Golden signals | Latency, traffic, errors, saturation | CloudWatch | Their shape classifies the failure |
| Correlation | Lining up signals on one UTC clock | All four tools | The core multi-service move |
| Blast radius | Breadth of impact of one change | IAM/SCP/KMS/DNS | Widest for policy edits |
| Cascading failure | Small hiccup amplified by retries | Service-to-service edges | Turns a blip into an outage |
| Service Quota | An account/Region limit (rate, count) | Service Quotas console | A capacity input, not an incident — if monitored |
| AZ impairment | A fault confined to one AZ | One Availability Zone | Survivable with cross-AZ headroom |
| COE | Amazon’s blameless postmortem format | After the incident | Closes the loop with CAPA items |
| CAPA | Owned, dated corrective/preventive actions | Tracker | An action without an owner is a wish |
The incident-response lifecycle
Under pressure, ad-hoc investigation produces tunnel vision: three engineers staring at the same dashboard while nobody talks to the customer or writes anything down. A lifecycle is the antidote. It is not bureaucracy — it is the set of jobs that must happen concurrently, and naming them lets you assign them.
| Phase | Goal | Key actions | Done when |
|---|---|---|---|
| Detect | Know there is an incident, fast | Alarms (CloudWatch composite + SLO burn-rate), synthetic canaries, customer reports, Health Dashboard events | An incident is declared with a severity and a single owner |
| Triage | Size the blast radius and assign roles | Confirm scope (one customer / one AZ / one Region / global), set severity, name an Incident Commander (IC), a comms lead and an ops lead | Everyone knows their role and the severity is agreed |
| Communicate | Keep stakeholders ahead of the rumour | Status page / internal channel updates on a cadence (e.g. every 30 min for Sev-1), business-impact statement | Stakeholders are updated and the cadence is running |
| Mitigate | Stop customer pain before you understand it fully | Roll back, fail over, raise a quota, shed load, disable a feature flag — anything that restores service | Customer impact is gone or sharply reduced |
| RCA | Find the true cause, not the trigger | Correlate logs/metrics/traces/CloudTrail on one timeline, five-whys | The causal chain is written down and agreed |
| Prevent | Make recurrence impossible or detectable | CAPA action items with owners and dates, new alarms, guardrails, runbook updates | Action items are tracked to completion |
Two principles separate seniors from juniors here. First: mitigate before you fully diagnose. The instinct to find the cause before acting is wrong during customer impact — if a rollback or failover restores service, do it, then investigate from a calm seat. The cause is still in the telemetry. Second: one Incident Commander. The IC does not debug; they coordinate, decide, and protect the responders from interruptions. The fastest incidents I have run had an IC who never touched a keyboard.
The three incident roles, and the trap when one person tries to wear two of them:
| Role | Owns | Explicitly does NOT | Failure mode if merged |
|---|---|---|---|
| Incident Commander | Decisions, role assignment, severity, the bridge | Debug, type commands | If the IC debugs, nobody coordinates → tunnel vision |
| Comms lead | Status page, exec/stakeholder updates on cadence | Diagnose, decide mitigations | If merged with IC, comms slip while debugging |
| Ops lead (responder) | Run the diagnosis and mitigations | Decide scope/severity, talk to execs | If responders self-coordinate, three theories, no progress |
A note on severity: pick a simple scale and stick to it. Severity drives the comms cadence and who gets paged — it is a routing decision, not a judgement of blame.
| Severity | Definition | Page / staffing | Comms cadence | Example |
|---|---|---|---|---|
| Sev-1 | Major customer-facing outage | All hands, IC + exec comms | Every 30 min | Checkout down org-wide |
| Sev-2 | Significant degradation or single-AZ/feature impact | On-call + IC | Every 60 min | One AZ impaired, N+1 absorbs it |
| Sev-3 | Minor or internal-only impact | On-call engineer | At resolution | A non-critical dashboard stale |
| Sev-4 | Cosmetic / no customer impact | Backlog ticket | None | A typo in an internal runbook |
The mitigate-then-fix sequencing is the move juniors get wrong most. Mitigations are fast and reversible and stop pain; fixes are slower and remove the cause. The same incident usually has both, and you do them in that order:
| Incident class | Mitigation (now, reversible) | Fix (later, removes cause) | Why the order matters |
|---|---|---|---|
| Throttle cascade (S1) | Flip the flag off; table → on-demand | Restore Query/GSI; add circuit breaker | Flag-off stops bleeding in seconds; code fix takes a sprint |
| AZ impairment (S2) | Fail traffic away from the AZ | Add cross-AZ headroom; test failover | You can’t “fix” an AZ — AWS does; you fix your tolerance |
| Quota breach (S3) | Request increase; shed/queue load | Right-size reservations; alarm at 80% | Increase is minutes; capacity planning is ongoing |
| IAM/SCP (S4) | Revert the policy edit | Re-apply the intent narrowly via pipeline | Revert restores access now; correct scoping needs review |
| DNS / cert (S5) | Revert record; deploy a valid cert | Restore ACM auto-renew; records-as-code | Cert/record swap is immediate; renewal hygiene is durable |
| Deploy regression | Roll back to the last good version | Fix forward and re-deploy through CI | Rollback is the fastest path to green |
The correlation toolkit: reading four tools against one clock
The defining move of multi-service RCA is correlation. The cause lives in a different service from the symptom, so you must line up signals from several tools on a single, UTC timeline. Configure every console and query to UTC during an incident — the single most common diagnostic error is comparing a log in local time against a metric in UTC and “proving” the wrong thing.
Here is what each tool is for, and crucially what it is not for:
| Tool | Answers | Best for | Blind spot |
|---|---|---|---|
| CloudWatch metrics | “What changed, and when?” | Spotting the inflection point — latency, errors, throttles, saturation | Aggregates hide per-request detail; high cardinality is expensive |
| CloudWatch Logs Insights | “What exactly happened in this service?” | Querying structured logs across log groups for patterns, error messages, request IDs | Only as good as your logging; cross-service joins are manual |
| CloudTrail | “Who did what to the control plane, and when?” | Tying an incident to a config change, deploy, IAM/SCP edit, quota change | Management events only by default; data events cost extra; ~minutes of delivery latency |
| X-Ray (traces) | “Where in the call graph is the time/error going?” | Following one request across services; the service map shows the failing edge | Sampled by default — the one trace you want may not be captured |
| AWS Health Dashboard | “Is this AWS’s problem?” | Ruling in/out an AWS-side event in your account/Region (PHD shows account-specific) | Not real-time to the second; absence of an event is not proof |
| Trusted Advisor | “Am I near a limit or misconfigured?” | Pre-incident hygiene and quota headroom checks | Periodic, not a live diagnostic |
Before the shapes, the concrete metrics you actually overlay — which service, which metric, which golden signal it represents, and the number that means trouble:
| Service | Metric | Golden signal | “Bad” reading | Pairs with |
|---|---|---|---|---|
| ALB | TargetResponseTime |
Latency | p99 climbing toward the LB timeout | HTTPCode_Target_5XX_Count |
| ALB | HTTPCode_ELB_5XX_Count |
Errors | Non-zero (LB itself, not target) | RequestCount, target health |
| ALB | HealthyHostCount (by AZ) |
Saturation | Drops in one AZ | per-AZ 5XX concentration |
| Lambda | Throttles |
Errors | Rising while ConcurrentExecutions flat |
Invocations |
| Lambda | ConcurrentExecutions |
Saturation | Flatlines at 1,000 / reserved | Throttles |
| DynamoDB | ReadThrottleEvents / WriteThrottleEvents |
Errors | Any non-zero | ConsumedReadCapacityUnits |
| DynamoDB | SuccessfulRequestLatency |
Latency | Spiking on one operation | throttle events |
| RDS | DatabaseConnections |
Saturation | Near max_connections |
CPUUtilization, ReadLatency |
| ECS / EKS | CPUUtilization / pool depth |
Saturation | Pool exhausted, CPU pinned | upstream latency |
| Service Quotas | AWS/Usage (per quota) |
Saturation | > 80% of applied value | the throttle metric it gates |
| API Gateway | 5XXError / Latency |
Errors / Latency | Spiking on one stage/route | integration (Lambda) errors |
| SQS | ApproximateAgeOfOldestMessage |
Saturation | Climbing — consumer not draining | consumer Throttles/errors |
The high-leverage skill is the golden-signal sweep: in the first two minutes, look at latency, traffic, errors and saturation (the four golden signals) for the affected service and its immediate dependencies, all on the same time window. The shape tells you the class of problem before you read a single log line. Here is the full shape-to-class map you scan first:
| Latency | Traffic | Errors | Saturation | It’s probably… | Go straight to |
|---|---|---|---|---|---|
| Flat | Flat | Up | Flat | A dependency rejecting fast (throttle/4xx/bad deploy) | The dependency’s throttle metrics |
| Up first | Flat | Up after | Rising | Saturation or a slow dependency; queue fills, then timeouts | X-Ray service map, then pool/queue depth |
| Up | Up | Up | Rising | Load-driven — a capacity or quota ceiling | Service Quotas + concurrency/throttle metrics |
| Step | Flat | Step | Flat | Something changed at an exact minute | CloudTrail lookup-events for that minute |
| Flat | Flat | Up (one third) | Flat in 2 AZs | Single-AZ impairment | ALB metrics BY AZ-ID + PHD |
| Flat | Flat | Flat (your view) | Flat | Failure is upstream (DNS/cert/edge) | dig +trace, openssl s_client |
A reading note that saves the most time — the difference between who emitted an error decides which logs you open:
| Question | The trap | How to tell |
|---|---|---|
| Did my service return the 5xx, or a layer in front? | An hour in the wrong logs | If X-Ray/logs show the request succeeding (slowly) but the client got 502, the ALB/CloudFront emitted it on timeout |
| Is the throttle in my app or in AWS? | “Scale up” masks it | An AWS throttle names the service in the error (ThrottlingException, ProvisionedThroughputExceeded); an app limit does not |
| Did a human change something, or did infra fail? | Chasing a config ghost | CloudTrail quiet at the inflection minute is a strong signal the cause is infrastructure (AZ), not a change |
Logs Insights and CloudTrail: the two queries you will type most
When errors spike, this Logs Insights query finds the dominant failure mode fast — bin by minute, count by status, and you see both the onset and the breakdown:
fields @timestamp, @message
| filter status >= 500 or @message like /Throttl|Timeout|ProvisionedThroughputExceeded/
| stats count(*) as errors by bin(1m), errorType
| sort errors desc
And when a metric shows a step change at a specific minute, this CloudTrail lookup answers “what changed”:
# Who/what touched the control plane around the inflection point (UTC)
aws cloudtrail lookup-events \
--start-time 2026-06-15T14:25:00Z \
--end-time 2026-06-15T14:35:00Z \
--query 'Events[].{Time:EventTime,User:Username,Event:EventName,Src:EventSource}' \
--output table
Pair a step change in a metric with a CloudTrail event in the same minute and you have usually found your trigger. For request-level correlation, propagate a request ID (and the X-Ray trace ID) through your logs so you can pivot from a metric, to the traces on that edge, to the exact log lines for that request. The exact CLI / console moves for each correlation step:
| Step | Tool | Exact command / path | What it proves |
|---|---|---|---|
| Find the inflection minute | CloudWatch | Metrics → set timezone UTC → note the step | When it started |
| See the failing edge | X-Ray | Service map → click the red edge → trace list | Where in the call graph |
| Break down the errors | Logs Insights | the stats count(*) by bin(1m), errorType query |
The dominant failure mode |
| Tie to a change | CloudTrail | aws cloudtrail lookup-events --start-time … --end-time … |
Who/what changed at that minute |
| Confirm AWS-side | Health Dashboard | PHD → events for your account/Region | Whether it’s AWS, not you |
| Pivot to one request | Logs + X-Ray | filter logs by reqId / trace_id |
The exact request’s path and lines |
The error & throttle reference
Before the worked scenarios, here is the lookup table you scan when an error string appears: the codes and exceptions you realistically see across services during a cross-service incident, what each means, the likely cause, how to confirm it, and the first fix. The non-obvious ones are the throttle exceptions (they name the service and scale with load) and the difference between a 502 the ALB emitted and a 5xx your app emitted.
| Code / exception | Where it appears | Likely cause | How to confirm | First fix |
|---|---|---|---|---|
ProvisionedThroughputExceededException |
DynamoDB SDK / logs | Read/write past table or partition capacity | ReadThrottleEvents/WriteThrottleEvents non-zero; hot partition |
On-demand or raise capacity; fix the access pattern |
ThrottlingException / Rate exceeded |
Many AWS APIs | Account/API request-rate quota hit under load | Service Quotas usage metric vs applied value | Backoff+jitter; request quota increase |
TooManyRequestsException (Lambda 429) |
Lambda invoke | Concurrency limit (account 1,000 or reserved) | Throttles up while ConcurrentExecutions flat |
Raise reserved concurrency; queue with SQS |
502 Bad Gateway |
ALB / CloudFront | Target gave no/bad answer, or upstream timeout | App trace shows request succeeding slowly → LB emitted it | Speed up target; raise idle/keep-alive; fix health |
503 Service Unavailable |
ALB | No healthy target in the target group | Target group HealthyHostCount = 0 (maybe one AZ) |
Restore healthy targets; check per-AZ |
504 Gateway Timeout |
ALB / API Gateway | Backend slower than the LB/integration timeout | Backend p99 climbing toward the timeout value | Speed up backend; raise timeout to match |
AccessDenied |
S3/KMS/STS/most APIs | IAM/SCP/resource-policy/KMS-key-policy deny | CloudTrail errorCode: AccessDenied; trace the eval chain |
Revert the offending policy edit |
AccessDeniedException (KMS) |
KMS-backed ops | Key policy/grant removed Decrypt/GenerateDataKey |
CloudTrail PutKeyPolicy/RevokeGrant |
Restore the key-policy statement/grant |
SERVFAIL / empty answer |
dig output |
DNS broken: bad record, delegation, health-check flip | dig +trace, dig @8.8.8.8 |
Revert ChangeResourceRecordSets; fix NS |
ERR_CERT_DATE_INVALID |
Browser / client | Expired or wrong cert served at the edge | openssl s_client notAfter/subject |
Deploy valid cert; restore ACM validation |
5xx from your app |
App logs / X-Ray | A real runtime exception in a handler | Logs Insights filter status >= 500; stack trace |
Fix the throwing code path |
ResourceLimitExceeded / LimitExceeded |
EC2/ENI/EIP APIs | A resource-count quota (ENIs, EIPs, instances) | Service Quotas applied value; Trusted Advisor limits | Request increase; right-size; clean up leaks |
RequestLimitExceeded |
EC2 control plane | API call-rate throttling (describe storms) | CloudTrail volume; SDK retry logs | Backoff+jitter; cache describes; spread calls |
InternalError / 5xx |
Any AWS API | Transient AWS-side error | PHD; retry succeeds | Retry with backoff; it’s usually self-healing |
ConditionalCheckFailedException |
DynamoDB | Optimistic-lock condition not met (often expected) | Is it in normal write paths? | Usually benign; only chase if it spikes |
Three reading notes that save the most time:
| Distinction | The trap | How to tell them apart |
|---|---|---|
| LB-emitted 502 vs app-emitted 5xx | Hours wasted in the wrong logs | If X-Ray shows the request succeeding (slowly) but the client got 502, the load balancer emitted it on timeout |
| Throttle vs application bug | “Scale up” masks the real ceiling | A throttle names the service and tracks traffic (vanishes at rest); a bug does not scale with load |
AccessDenied from IAM vs SCP/KMS |
Editing the wrong policy layer | Many services denied at one minute points above the app — SCP (mgmt account) or KMS key policy, not per-service IAM |
The five worked scenarios that follow are the five complex-incident classes you will actually meet. Before the detail, here they are side by side on the dimensions that distinguish them — so mid-incident you can match the fingerprint to the scenario and jump to the right playbook:
| Scenario | Distinguishing fingerprint | Confirming tool | CloudTrail at the minute | Mitigation | Durable fix |
|---|---|---|---|---|---|
| S1 — Throttle cascade | Latency up first, then 5xx, traffic flat | X-Ray red edge + ReadThrottleEvents |
A deploy/flag on a third service | Kill the flag; on-demand; backoff | Restore Query/GSI; circuit breaker; bulkheads |
| S2 — AZ impairment | ~1/3 fail, 2/3 fine, persistent | ALB HealthyHostCount by AZ + PHD |
Quiet (infra, not a change) | Fail away from the AZ | Cross-AZ N+1; tested Multi-AZ failover |
| S3 — Quota / limit | Errors track traffic, vanish at rest | Service Quotas + Throttles |
Maybe a lowered reservation | Raise quota; queue; shed load | Right-size concurrency; alarm at 80% |
| S4 — IAM/SCP blast radius | Many services AccessDenied, one minute |
CloudTrail in mgmt account | The policy edit, exactly | Revert the policy change | Policy-as-code with review pipeline |
| S5 — DNS / cert | Dashboards green, users can’t load | dig +trace / openssl s_client |
ChangeResourceRecordSets or none |
Revert record / deploy valid cert | ACM auto-renew; record-as-code; canary |
Worked scenario 1 — Cascading failure from a throttled dependency
Symptom. The checkout API’s p99 latency climbs from 200 ms to 8 s over ten minutes, then it starts returning 5xx. Traffic is flat — no marketing spike. The on-call for checkout swears they deployed nothing.
Hypotheses.
- Checkout itself regressed (deploy, memory leak, GC). Unlikely — no deploy, and latency rose before errors.
- A downstream dependency slowed or started rejecting, and checkout’s threads/connections are now blocked waiting on it (back-pressure cascade).
- A shared resource (a database, a connection pool, a Lambda concurrency limit) is saturated and several services are contending for it.
Cross-service diagnosis. The golden-signal shape — latency up first, then errors, traffic flat — points away from load and towards a slow or rejecting dependency. Open the X-Ray service map for the checkout request: it shows checkout → order-service → a DynamoDB table, and the order-service → DynamoDB edge is red with high latency and a fault rate. Pivot to CloudWatch and overlay the table’s ReadThrottleEvents (or ThrottledRequests) metric — it went non-zero exactly when checkout’s latency began to climb. Now the question is why is the table throttling on flat traffic? CloudTrail lookup-events for the preceding hour shows a deploy on a different service, and a Logs Insights query on order-service shows it is doing a Scan where it used to Query because a feature flag flipped. The Scan reads the whole partition, blows the table’s provisioned (or hot-partition) capacity, DynamoDB throttles, order-service retries (amplifying load), its connection pool fills, and checkout — waiting on order-service — backs up and finally times out. Classic cascade: the symptom is in checkout, the trigger is a flag on a third service, the bottleneck is one DynamoDB partition.
The cascade as a hop-by-hop chain, so you can see exactly where it amplified:
| Hop | What happens | Golden signal here | The amplifier |
|---|---|---|---|
| Flag flips on svc-C | Query → Scan on order-service |
— | Reads the whole partition |
| DynamoDB partition | Capacity blown → throttle | ReadThrottleEvents up |
Hot-partition limit |
| order-service | Retries the throttled call | Latency up, then errors | Retries with no backoff multiply load |
| Connection pool | Fills with blocked waiters | Saturation up | No bulkhead → shared pool exhausts |
| checkout | Threads blocked on order-service | p99 up, then 5xx | No timeout / circuit breaker |
Fix.
- Mitigate now: turn off the feature flag that introduced the Scan (removes the load source instantly); if the table is provisioned, bump capacity or enable on-demand to absorb the burst; ensure clients use exponential backoff with jitter so retries stop amplifying.
- Fix the cause: restore the access pattern to a
Queryagainst a proper key/GSI; add a circuit breaker in checkout so a slow dependency fails fast instead of consuming all threads; set sensible client timeouts and bounded retries; consider DAX or caching for the hot read. - Prevent: alarm on
ReadThrottleEventsand on the saturation of the connection pool; add a canary that exercises checkout end to end; require load-testing of access-pattern changes; adopt bulkheads so one dependency cannot exhaust a shared pool.
The resilience primitives that contain a cascade, and what each one stops:
| Primitive | What it does | Stops which failure | Cost / trade-off |
|---|---|---|---|
| Timeout | Caps how long a call waits | Threads pinned on a slow dependency | Too tight → false failures |
| Exponential backoff + jitter | Spreads and slows retries | Retry storms amplifying a throttle | Slightly higher tail latency |
| Circuit breaker | Fails fast when a dependency is sick | Cascade through the caller | Needs tuning of open/half-open thresholds |
| Bulkhead | Isolates dependencies into separate pools | One downstream exhausting a shared pool | More pools to size and monitor |
| Caching / DAX | Cuts read volume to the hot path | Hot-partition throttle | Staleness; cache invalidation |
The lesson: retries and missing timeouts turn a small dependency hiccup into a system-wide outage. Backoff with jitter, timeouts, circuit breakers and bulkheads are the resilience primitives that contain a cascade. The access-pattern depth is in DynamoDB Deep Dive: Tables, Keys, Capacity, GSIs & Streams.
Worked scenario 2 — Availability Zone impairment and failover
Symptom. Error rate jumps to roughly one-third of requests and p99 latency spikes, but two-thirds of requests are perfectly healthy. The pattern is partial and persistent — refreshes sometimes succeed, sometimes fail.
Hypotheses.
- A bad deploy on a subset of instances (canary gone wrong).
- One AZ is impaired — networking, a backing service, or capacity in that zone.
- A single unhealthy backend (one RDS replica, one cache node) serving a fraction of traffic.
Cross-service diagnosis. “One-third failing, two-thirds fine” across a three-AZ deployment is the signature of single-AZ impairment. Confirm it three ways. First, the AWS Health Dashboard (Personal Health Dashboard) — check for an account-specific event naming an AZ ID (note: AZ names like us-east-1a are randomised per account; the Health event and your metrics should be reconciled by AZ ID, e.g. use1-az2). Second, break the load balancer’s metrics down by AZ: target group HealthyHostCount has dropped in one AZ and HTTPCode_Target_5XX_Count is concentrated there. Third, check the data tier — if RDS is Multi-AZ, is the primary in the impaired zone? Is one ElastiCache shard’s primary there? CloudTrail will be quiet — this is not a change you made, which itself is a strong signal that the cause is infrastructure, not config.
The three confirmations, with the exact signal each gives:
| Confirmation | Where | Exact signal | Reads as |
|---|---|---|---|
| Health event | PHD (Personal Health Dashboard) | An open event naming an AZ ID (use1-az2) |
AWS has acknowledged the zone |
| Per-AZ LB metrics | CloudWatch (ALB, dimension by AZ) | HealthyHostCount ↓ and HTTPCode_Target_5XX_Count ↑ in one AZ |
Your traffic confirms the zone |
| Data-tier primary | RDS / ElastiCache console | Multi-AZ primary or shard primary sits in the impaired AZ | The stateful blast radius |
| CloudTrail | CloudTrail | Quiet at the inflection minute | Not a change you made → infra |
The CLI to break ALB health down by AZ and to trigger an RDS failover:
# Healthy host count per AZ for one target group — the AZ with the drop is impaired
aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB --metric-name HealthyHostCount \
--dimensions Name=TargetGroup,Value=targetgroup/checkout/abc \
Name=AvailabilityZone,Value=use1-az2 \
--start-time 2026-06-15T14:00:00Z --end-time 2026-06-15T15:00:00Z \
--period 60 --statistics Minimum --output table
# If the impaired AZ holds the RDS primary, fail over Multi-AZ (promotes the standby)
aws rds reboot-db-instance --db-instance-identifier orders-prod --force-failover
Fix.
- Mitigate now: fail traffic away from the bad AZ. If targets are auto-registered, the load balancer’s health checks should already be routing around unhealthy targets — verify cross-zone load balancing and that healthy capacity in the other AZs can absorb the shifted load (this is why you provision for N+1 across AZs). For data, if the impaired AZ holds an RDS primary, trigger a Multi-AZ failover; promote a healthy read replica or fail over the cache primary as needed.
- Fix the cause: you do not fix an AZ — AWS does. Your job is to confirm your failover actually works and that capacity headroom in surviving AZs is real, not theoretical.
- Prevent: deploy across at least three AZs with enough headroom to lose one; enable RDS Multi-AZ and test failover regularly (game days); make sure Auto Scaling is balanced across AZs and that subnets exist in each; alarm on per-AZ healthy-host count, not just the aggregate.
What the architecture must already have for an AZ loss to be a non-event:
| Guardrail | Why | How to verify | What its absence causes |
|---|---|---|---|
| ≥3 AZs with N+1 headroom | Survive losing one zone with room to spare | ASG spread; subnet-per-AZ | Surviving AZs overload when one fails |
| Cross-zone load balancing | Spread shifted load evenly | ALB attribute enabled | One AZ’s targets overload |
| RDS Multi-AZ + tested failover | Promote standby fast | Game-day a forced failover | Stateful tier stuck in the bad AZ |
| Per-AZ alarms (not just aggregate) | The aggregate hides a one-AZ drop | Alarm on HealthyHostCount per AZ |
You learn of it from customers |
| Reconcile by AZ ID, not name | AZ names are randomised per account | Map us-east-1a↔use1-az2 |
You chase the wrong zone |
The lesson: design assuming an AZ will fail, then an AZ failure is a non-event. The incident is only severe if your architecture cannot tolerate losing one zone. The Multi-AZ depth is in RDS & Aurora Deep Dive: Engines, Multi-AZ, Replicas & Backups, and the load-balancer mechanics in Elastic Load Balancing: ALB, NLB & GWLB Deep Dive.
Worked scenario 3 — A service-quota breach under load
Symptom. During a traffic surge, a fraction of API requests fail with errors that mention Rate exceeded, LimitExceeded, or ThrottlingException — and the failures track traffic: more load, more failures, and they vanish when load drops. Nothing is “down”; the system is hitting an invisible ceiling.
Hypotheses.
- An application-level rate limit or a downstream API throttle.
- A Lambda concurrency or burst limit (account or function-level reserved concurrency).
- An account Service Quota — API request rate, ENIs per Region, concurrent executions, EIPs, etc. — being exceeded as load scales.
Cross-service diagnosis. The “fails in proportion to load, recovers when load drops” shape is the fingerprint of a quota or limit, not a bug. Identify which ceiling. For Lambda, CloudWatch shows Throttles rising with invocations while ConcurrentExecutions flatlines at a round number (1,000 by default, or your reserved figure) — that is the concurrency limit. For API throttling, the error envelope names the service; check the Service Quotas console for that service’s “applied quota value” and compare against your CloudWatch usage metrics (many quotas now publish a usage metric you can alarm on). CloudTrail can show whether someone recently lowered a reserved concurrency or whether a new function ate the account pool. The tell that distinguishes this from scenario 1: here the errors are immediate throttles that scale with traffic, not latency-then-timeout from a slow dependency.
The common quota ceilings that bite under load, their default, and how each shows up:
| Quota | Typical default | Metric / signal | Symptom under load | Mitigation |
|---|---|---|---|---|
| Lambda concurrent executions (account) | 1,000 | Throttles up, ConcurrentExecutions flat at 1,000 |
429 TooManyRequestsException |
Raise account quota; reserved per function |
| Lambda reserved concurrency (function) | shared pool | Throttles on one function only |
That function throttles, others fine | Raise/remove the too-low reservation |
| API request rate (per service) | service-specific | ThrottlingException, Rate exceeded |
A fraction of calls 400/429 | Backoff+jitter; quota increase |
| ENIs per Region | account-specific | ResourceLimitExceeded on scale-out |
New tasks/ENIs fail to launch | Request increase; reduce ENI churn |
| Elastic IPs per Region | 5 | AddressLimitExceeded |
Cannot allocate an EIP | Request increase; release unused EIPs |
| DynamoDB on-demand throughput | account/table | ProvisionedThroughputExceeded (burst) |
Throttles on a sudden spike | Pre-warm; switch capacity mode |
Fix.
- Mitigate now: request a quota increase via the Service Quotas console or API (some are auto-approved, some go to Support — for a Sev-1, open a Support case in parallel); for Lambda, raise reserved concurrency or remove a too-low reservation that is starving the function; shed non-critical load or queue it (SQS) to flatten the spike below the ceiling.
- Fix the cause: right-size reserved concurrency per function so one workload cannot starve others; put a queue in front of spiky producers so the consumer drains at a controlled rate; cache to cut call volume to the throttled API.
- Prevent: this is the headline preventive — monitor quotas proactively. Use Service Quotas usage metrics and Trusted Advisor’s service-limit checks to alarm at ~80% of every quota that scales with your traffic, so the next surge is a warning, not an outage.
The CLI to read an applied quota and to request an increase:
# Read the applied value for Lambda concurrent executions (quota L-B99A9384)
aws service-quotas get-service-quota \
--service-code lambda --quota-code L-B99A9384 \
--query 'Quota.{Name:QuotaName,Applied:Value}' --output table
# Request an increase (some auto-approve; others raise a Support case)
aws service-quotas request-service-quota-increase \
--service-code lambda --quota-code L-B99A9384 --desired-value 3000
The lesson: quotas are a capacity-planning problem, not an incident. If you discover a quota during an outage, you have a monitoring gap, not just a limit. The concurrency mechanics are in Lambda Deep Dive: Runtimes, Triggers, Layers & Concurrency.
Worked scenario 4 — An IAM/SCP change with a wide blast radius
Symptom. At a precise timestamp, multiple unrelated services across one or several accounts start failing with AccessDenied — uploads to S3, a Lambda that can no longer write to DynamoDB, an ECS task that cannot pull from ECR. No application deploy happened. The breadth is the clue: many services, one moment, same error class.
Hypotheses.
- A KMS key policy or grant was changed and everything that decrypts through it now fails (a wide but specific blast radius).
- An IAM change — a shared role’s permissions, or a permission boundary tightened.
- A Service Control Policy (SCP) at the Organization/OU level changed and is now denying actions org-wide regardless of IAM (SCPs set the maximum permissions; an explicit deny there wins everywhere).
Cross-service diagnosis. When AccessDenied appears across many services at the same minute, suspect a policy layer above the application, because identity-based policies are usually edited per service. Go straight to CloudTrail and look up PutBucketPolicy, PutKeyPolicy, PutRolePolicy, DeleteRolePolicy, PutPermissionsBoundary, and crucially the Organizations events UpdatePolicy/AttachPolicy (SCPs are recorded in the management/delegated-admin account’s CloudTrail). The timestamp of the change will line up exactly with the onset. To prove which layer is denying, take one failing call and reason through the evaluation chain: an SCP deny blocks the action no matter what IAM allows, so if the IAM policy looks correct but the call still fails org-wide, the SCP (or a permission boundary, or a resource-policy/KMS-key-policy deny) is the culprit. CloudTrail’s errorCode: AccessDenied entries, combined with knowing that explicit deny always wins, let you walk from symptom to the exact policy edit.
The policy layers, ordered by reach, with the CloudTrail event that records an edit to each:
| Layer | Reach | Explicit deny here means… | CloudTrail event to look up | Where the trail lives |
|---|---|---|---|---|
| SCP (Organizations) | Whole OU / org | Denied everywhere, regardless of IAM | AttachPolicy, UpdatePolicy, DetachPolicy |
Management / delegated-admin account |
| Resource policy (S3 bucket, etc.) | That resource | Denied on that resource for all principals | PutBucketPolicy, PutRepositoryPolicy |
The resource’s account |
| KMS key policy / grant | Everything using the key | Decrypt/encrypt fails for the named principals | PutKeyPolicy, RevokeGrant, CreateGrant |
The key’s account |
| Permission boundary | The bounded principal | Caps that role even if its policy allows | PutPermissionsBoundary, DeleteRolePermissionsBoundary |
The principal’s account |
| Identity policy (IAM) | One user/role | The usual per-service permission | PutRolePolicy, AttachRolePolicy, DeleteRolePolicy |
The principal’s account |
The decision table — given the breadth and the trail, which layer is the culprit:
| If you see… | And CloudTrail shows… | It’s probably… | Do this |
|---|---|---|---|
| Many services, many accounts, one minute | AttachPolicy/UpdatePolicy in mgmt account |
An SCP edit | Detach/correct the SCP in the management account |
| Everything that decrypts fails | PutKeyPolicy/RevokeGrant |
A KMS key policy/grant change | Restore the Decrypt/GenerateDataKey statement |
| One bucket denies all writers | PutBucketPolicy |
A resource policy edit | Revert the bucket policy statement |
| One role suddenly limited | PutPermissionsBoundary |
A tightened boundary | Loosen/correct the boundary narrowly |
| One service, one account | PutRolePolicy/DeleteRolePolicy |
An identity policy edit | Revert the role policy |
Fix.
- Mitigate now: revert the offending policy change — this is the fastest mitigation and CloudTrail tells you precisely what changed and to what. If it was an SCP, detach or correct it in the management account; if a KMS key policy, restore the statement that granted the failing principals
kms:Decrypt/kms:GenerateDataKey. - Fix the cause: re-apply the intended restriction correctly and narrowly — most blast-radius incidents come from an overly broad deny or a removed-too-much edit. Scope conditions tightly; never broaden a deny without modelling who it hits.
- Prevent: manage IAM/SCPs/key policies as code through a pipeline with review and a plan/diff, never click-ops in the console; test SCP changes against a non-production OU first; use IAM Access Analyzer and policy simulation pre-merge; alarm on CloudTrail for sensitive events (
PutKeyPolicy, OrganizationsAttachPolicy/UpdatePolicy) so a risky change is visible immediately.
The CloudTrail lookup that finds the offending edit, by event name:
# Find SCP/role/key-policy edits in the inflection window (run in the mgmt account for SCPs)
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=AttachPolicy \
--start-time 2026-06-15T14:25:00Z --end-time 2026-06-15T14:35:00Z \
--query 'Events[].{Time:EventTime,User:Username,Event:EventName}' --output table
The lesson: a one-line policy edit can have the widest blast radius in AWS. The layered evaluation model (SCP → resource policy → permission boundary → identity policy, with explicit deny trumping all) is exactly what you walk during diagnosis — and exactly why these changes belong in a reviewed pipeline. The governance depth is in Organizations, SCP Guardrails & Delegated Admin and KMS Encryption Deep Dive: Keys, Policies, Envelope & Rotation.
Worked scenario 5 — A Route 53 / DNS or certificate failure
Symptom. Users report “the site won’t load” or “your connection is not private,” but your load balancer and application metrics look completely healthy — low latency, no 5xx, normal CPU. The failure is happening before traffic reaches your infrastructure, which is why your dashboards are clean.
Hypotheses.
- DNS resolution is failing or returning the wrong answer (a record was changed/deleted, a failover/health-check flipped, an NS delegation issue).
- A TLS certificate expired or the wrong certificate is being served (ACM cert not renewed because validation lapsed; cert/domain mismatch).
- CDN/edge or WAF is blocking or misrouting at the edge.
Cross-service diagnosis. Healthy backend metrics with “can’t load” reports is the signature of an edge/DNS/cert problem. Split the two possibilities at the command line. For DNS, resolve the name and inspect the answer end to end:
dig +trace example.com # follows the delegation chain NS by NS
dig example.com A @8.8.8.8 # what a public resolver actually returns
If dig returns the wrong IP, an empty answer, or SERVFAIL, the problem is DNS. In Route 53, check whether a health check flipped a failover record (a failing health check will route away from the healthy endpoint), whether a record was edited (CloudTrail ChangeResourceRecordSets shows the change and the actor), and whether the hosted zone’s NS records still match the registrar’s delegation. For TLS, inspect the served certificate’s expiry and subject:
echo | openssl s_client -connect example.com:443 -servername example.com 2>/dev/null \
| openssl x509 -noout -dates -subject -issuer
An expired notAfter, or a subject/SAN that does not cover the host, is your cause. With ACM, expiry almost always traces back to DNS validation breaking (the CNAME validation record was removed, so ACM could not auto-renew) — and CloudTrail/ACM events will show the renewal failures.
The edge failure modes, the one command that confirms each, and the fix:
| Failure mode | Confirm with | Tell-tale | Mitigate | Fix the cause |
|---|---|---|---|---|
| Record edited/deleted | dig A @8.8.8.8; CloudTrail ChangeResourceRecordSets |
Wrong/empty answer | Revert the record change | Manage records as reviewed code |
| Failover health-check flipped | Route 53 health-check status | Traffic routed to the secondary/none | Correct the health-check config | Tune thresholds; right endpoint |
| NS delegation mismatch | dig +trace; compare registrar NS |
Delegation chain breaks | Fix registrar NS to match zone | Document delegation; alarm on drift |
| Cert expired | openssl s_client notAfter |
ERR_CERT_DATE_INVALID |
Deploy a valid cert now | Let ACM auto-renew; keep CNAME |
| Wrong cert / SAN mismatch | openssl s_client -subject |
Host not in subject/SAN | Attach the correct cert | Issue/replace cert covering the host |
| ACM auto-renew failed | ACM console; CloudTrail renewal events | Validation CNAME missing | Re-add the validation CNAME | Keep the DNS validation record intact |
Fix.
- Mitigate now: for DNS, revert the bad
ChangeResourceRecordSetsor correct the failover/health-check configuration so traffic routes to the healthy endpoint; for an expired cert, deploy a valid certificate to the load balancer/CloudFront immediately. Remember DNS TTLs mean changes are not instant — lower the TTL if you anticipate needing fast cutovers. - Fix the cause: restore the ACM DNS validation CNAME so auto-renewal works (this is the single most common ACM expiry cause); fix the NS delegation if the registrar and hosted zone disagree; correct health-check thresholds that are too sensitive.
- Prevent: let ACM manage and auto-renew certificates with DNS validation kept intact (no manual certs to forget); alarm on
DaysToExpiryfor any certificate; manage Route 53 records as code with review; add an external synthetic check (a canary that resolves DNS and validates the cert from outside your VPC) so you detect an edge failure your internal dashboards cannot see.
The lesson: when your metrics are healthy but users cannot reach you, look outward — DNS, certificate, edge — because the failure is upstream of everything your CloudWatch sees. The routing and health-check depth is in Route 53: DNS Records, Routing Policies & Health Checks.
Service Quotas, Trusted Advisor and the Health Dashboard as prevention
Three of the five scenarios above (and most real incidents) are preventable with hygiene that lives outside the application:
- Service Quotas is the source of truth for account limits. Most quotas now publish a usage CloudWatch metric — alarm at ~80% of every quota that scales with traffic (concurrent Lambda executions, ENIs/Region, EIPs, API request rates, RDS instances). Request increases ahead of known peaks.
- Trusted Advisor runs periodic checks across cost, performance, security, fault tolerance and service limits. Its limit and fault-tolerance checks are a cheap pre-incident sweep; its security checks catch the open-bucket and over-broad-policy classes before they bite.
- AWS Health Dashboard — the Personal Health Dashboard (PHD) shows events specific to your account and resources (scheduled maintenance, AZ events, deprecations) and is your first stop to answer “is this AWS or me?” The Service Health Dashboard is the public, all-customers view. PHD can fire EventBridge events so AWS-side issues page you automatically.
The three prevention tools, what they catch, their cadence, and how to wire an alarm:
| Tool | Catches | Cadence | How to alarm / automate | Cost |
|---|---|---|---|---|
| Service Quotas (usage metrics) | Approaching a limit that scales with traffic | Near-real-time metric | CloudWatch alarm at 80% of AWS/Usage |
Free |
| Trusted Advisor (limit + FT checks) | Service-limit headroom, single-AZ risk, open buckets | Periodic (refresh) | EventBridge on check status; weekly review | Business/Enterprise Support for full set |
| AWS Health (PHD) | AWS-side events for your resources/AZs | Event-driven | EventBridge rule → SNS/page | Free |
| CloudWatch composite alarm | Multi-signal SLO breach (burn rate) | Real-time | Composite of golden-signal alarms | Per-alarm pricing |
The pattern across all three: convert future incidents into present-day alarms. A quota you alarm on at 80% is a Tuesday-afternoon ticket; the same quota discovered at 100% during a launch is a Sev-1.
Closing the loop: the blameless COE
Prevention is only real once it is written down and tracked. Amazon’s Correction of Error (COE) is a blameless postmortem with a fixed shape — and the discipline is that every contributing factor produces a tracked, owned, dated CAPA action item. The sections, what each captures, and the trap if you skip it:
| COE section | What it captures | Done well | Trap if skipped |
|---|---|---|---|
| Summary | One paragraph: what broke, who was impacted | Plain, customer-framed | Reads as internal jargon nobody acts on |
| Impact | Duration, scope, requests/revenue affected | Quantified, not “some users” | Severity gets re-litigated later |
| Timeline | Detection → mitigation → resolution on one UTC clock | Minute-by-minute with actors | “It was a blur” — no learning |
| Five-whys | Trigger → … → systemic root cause | Reaches a missing alarm/timeout/review | Stops at the trigger; it recurs |
| Contributing factors | Everything that made it worse or slower | Honest, blameless | Single-cause myth hides real gaps |
| CAPA action items | Owned, dated corrective + preventive work | One alarm/guardrail per factor | A wish list nobody completes |
| Lessons learned | What the team now knows | Shared org-wide | Knowledge stays in one head |
The five-whys is where juniors stop too early. “DynamoDB throttled” is a trigger, not a root cause; keep asking until you reach the systemic gap — why was there no throttle alarm, no backoff, no load test on access-pattern changes? — because that gap is what the CAPA items must close.
Architecture at a glance
The diagram traces a real request as it actually flows and maps each of the five complex-incident classes onto the exact hop where it bites. Read it left to right. A request enters at the edge: Route 53 resolves the name and CloudFront terminates TLS with an ACM certificate — this is where DNS/cert failures (badge 1) strike, and the cruel part is that everything downstream stays green, so your dashboards lie. It passes into ingress, an ALB spread across three AZs with WAF in front — this is where a single-AZ impairment (badge 2) shows up as one-third of requests failing while the per-AZ HealthyHostCount drops in exactly one zone. It reaches compute: the checkout service calls the order service (ECS/EKS pools and Lambda concurrency), where a quota or concurrency ceiling (badge 3) throttles in proportion to load. The order service then hits the data tier — a DynamoDB table whose hot partition throttles and cascades back upstream (badge 4) when a flag flips Query to Scan, plus an RDS Multi-AZ instance whose primary may sit in the impaired zone.
Above and beneath the path runs the control & correlation plane — CloudWatch (metrics + Logs Insights), CloudTrail (who changed what), and the X-Ray service map (the fault edge) — all read against one UTC clock. That plane is also where an IAM/SCP blast-radius change (badge 5) is diagnosed: a one-line policy edit denies many services at one minute, and only CloudTrail in the management account shows the edit. The whole method is in the picture: localise the symptom to a hop, read the golden-signal shape, run the named tool on one clock, and walk to the cause even when it lives a hop or a policy-layer away.
Real-world scenario
Northwind Commerce runs an e-commerce platform on AWS in ap-south-1 (Mumbai): CloudFront → ALB (three AZs) → an ECS Fargate checkout service → a Lambda order service → a DynamoDB orders table, with RDS Multi-AZ (PostgreSQL) for the ledger and an SQS queue for async fulfilment. Traffic averages 600 requests/second with a 7pm festival-sale spike to ~2,400 rps. The platform team is five engineers across two squads; the monthly AWS spend is about ₹9,80,000.
The incident began on a Saturday festival sale. At 19:06 the SLO burn-rate alarm fired: checkout error rate crossing 5% and climbing. The first responder’s reflex was to look at the checkout service — CPU normal, no deploy, recent logs unremarkable. Meanwhile the on-call for the order squad swore they had not deployed either. Two squads, two dashboards, and for the first eight minutes no Incident Commander — three theories, no progress, exactly the failure the lifecycle exists to prevent.
The turn came when a senior engineer took the IC role, put every console in UTC, and ran the golden-signal sweep on checkout and its dependencies. The shape was unambiguous: checkout p99 rose first (200 ms → 6 s), then errors followed, with traffic only modestly up — a slow/rejecting dependency, not load. The X-Ray service map showed the order-service → DynamoDB edge glowing red with a fault rate. Overlaying the table’s ReadThrottleEvents in CloudWatch, the metric went non-zero at 19:04 — two minutes before the alarm. Now the real question: why throttle on near-flat traffic? A CloudTrail lookup-events for the preceding hour surfaced a UpdateFunctionConfiguration on a recommendations Lambda at 18:58, and a Logs Insights query on order-service showed it had started doing a Scan instead of a Query — a shared library upgrade in that deploy had flipped a feature flag that changed the access pattern. The Scan hammered one partition, DynamoDB throttled, order-service retried without backoff (amplifying the load), its connection pool filled, and checkout — blocked waiting on order-service — backed up and finally 5xx’d. The symptom was in checkout; the trigger was a deploy on a third service; the bottleneck was one DynamoDB partition.
Mitigation, in order: the IC had the recommendations squad revert the flag (removing the Scan instantly), flipped the orders table to on-demand to absorb the residual burst, and confirmed clients now had backoff with jitter via a config push. Error rate fell below 1% within four minutes of the flag revert. They did not scale the checkout service — that would have masked nothing and cost money, the classic reflex the team had been burned by before.
The durable fix landed the following week: restore the Query access pattern against the right GSI; add a circuit breaker in checkout so a slow order-service fails fast instead of consuming the whole pool; bound retries and set a 2-second client timeout; and add bulkheads so the order dependency cannot exhaust checkout’s shared pool. On prevention they wired three alarms — ReadThrottleEvents > 0, connection-pool saturation, and an end-to-end checkout canary — and added a CI gate requiring a load test for any access-pattern change. The next festival sale ran at 2,500 rps with zero DynamoDB throttles, checkout p95 held at 180 ms, and the COE’s headline line went on the wall: “The failing service is rarely the broken one — read the shape, follow the edge, find the flag.”
The incident as a timeline, because the order of moves is the lesson:
| Time (UTC offset) | Symptom | Action taken | Effect | What it should have been |
|---|---|---|---|---|
| 18:58 | (none yet) | recommendations deploy flips a flag | Scan begins on order-service | Load-test access-pattern changes in CI |
| 19:04 | ReadThrottleEvents > 0 |
(no alarm — gap) | Throttle starts; latency builds | An alarm here would pre-empt the page |
| 19:06 | Checkout error > 5% | Burn-rate alarm fires; stare at checkout | 8 min lost, no IC | Name an IC immediately |
| 19:14 | Still climbing | Senior takes IC, sweeps golden signals in UTC | Shape = slow dependency | The breakthrough |
| 19:17 | Edge identified | X-Ray red edge + ReadThrottleEvents overlay |
Bottleneck = one DDB partition | — |
| 19:20 | Trigger found | CloudTrail shows 18:58 deploy; Logs show Scan | Cause = flipped flag | — |
| 19:24 | Mitigated | Revert flag; table → on-demand; backoff push | Errors < 1% | Correct night-of mitigation |
| +1 week | Fixed | Query+GSI, circuit breaker, bulkheads, 3 alarms, canary | 0 throttles at 2,500 rps | The actual fix is code + alarms |
Advantages and disadvantages
A disciplined, correlation-first incident method has clear strengths and real costs. Weigh it honestly before you mandate it:
| Advantages (why this method pays off) | Disadvantages (why it has a cost) |
|---|---|
| Finds the cause even when it’s a hop or a policy-layer away — the symptom service is rarely the broken one | Requires instrumentation before the incident: structured logs with trace IDs, X-Ray on the critical path, quota alarms |
| The golden-signal shape classifies the failure in two minutes, before any log-reading | Needs practised judgement — reading shapes well is a skill, not a checklist |
| One UTC clock across four tools removes the “wrong time zone” class of false conclusions | Discipline slips under pressure; it takes drills (game days) to make it muscle memory |
| Mitigate-then-diagnose minimises customer impact; the telemetry keeps the evidence | The instinct to “understand first” is strong; juniors resist mitigating blind |
| The IC role prevents three-theory thrash and protects responders | A dedicated IC is staffing overhead small teams feel acutely |
| Blameless COE + CAPA closes the loop so incidents don’t recur | Postmortems take time and only pay off if CAPA items are tracked to done |
| Quota/AZ/cert hygiene converts whole incident classes into Tuesday tickets | Alarms and canaries cost a little to run and must be maintained (alarm fatigue is real) |
The method is right for any team running a real distributed system where minutes of customer impact are expensive and incidents span services and teams. It is overkill for a single static site or a one-service hobby project — there, a stack trace is enough. The disadvantages are all front-loaded investment: instrument, alarm, drill. Pay it before the incident, or pay double during one.
Hands-on lab — correlate a self-inflicted incident (Free Tier)
You will create a tiny, safe incident and practise the correlation workflow end to end, then clean up. Everything here is Free-Tier-eligible if you tear it down promptly.
1. Set up a function and a log group. Create a minimal Lambda (any runtime) named coe-lab that logs a structured line and occasionally “fails”:
aws lambda create-function --function-name coe-lab \
--runtime python3.13 --handler index.handler --timeout 5 \
--role arn:aws:iam::111122223333:role/your-lambda-basic-role \
--zip-file fileb://function.zip
(function.zip contains an index.py whose handler prints {"level":"INFO","reqId":context.aws_request_id,...} and raises an exception when the event has {"fail": true}.)
2. Generate a signal. Invoke it a few times, including failures, to put both success and error lines into CloudWatch Logs:
for i in $(seq 1 20); do
aws lambda invoke --function-name coe-lab --payload '{"fail":false}' /dev/null >/dev/null
done
aws lambda invoke --function-name coe-lab --payload '{"fail":true}' /dev/null
3. Triage with metrics. In the CloudWatch console, open the Errors, Invocations and Throttles metrics for coe-lab (set the timezone to UTC). Note the minute the error appears — that is your inflection point.
4. Drill in with Logs Insights. Run a query against the function’s log group:
fields @timestamp, @message
| filter @message like /ERROR|Traceback|"level":"ERROR"/
| stats count(*) as errors by bin(1m)
| sort @timestamp desc
Confirm the error count and timing match the metric. This is the metric→logs pivot you will do in every real incident.
5. Tie it to a change with CloudTrail. Update the function’s configuration (a harmless change) to create a control-plane event:
aws lambda update-function-configuration --function-name coe-lab --timeout 6
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue=coe-lab \
--query 'Events[].{Time:EventTime,User:Username,Event:EventName}' --output table
You should see your UpdateFunctionConfiguration event with its timestamp and actor — exactly how you would prove “someone changed this at 14:31”.
6. Read a quota’s applied value. Practise the scenario-3 move — see how much headroom you have on Lambda concurrency:
aws service-quotas get-service-quota \
--service-code lambda --quota-code L-B99A9384 \
--query 'Quota.{Name:QuotaName,Applied:Value}' --output table
Validation. You have, on one UTC timeline, (a) a metric inflection, (b) the matching log lines, © the control-plane change that explains a config-driven incident, and (d) a read of the quota that scenario 3 would breach. That is the core RCA loop in miniature.
Each lab step mapped to the real incident move it rehearses:
| Lab step | Real-incident move | Tool |
|---|---|---|
| 3 — note the inflection minute (UTC) | Golden-signal sweep | CloudWatch metrics |
| 4 — Logs Insights error count by minute | Metric → logs pivot | Logs Insights |
5 — CloudTrail lookup-events |
Tie a step change to a change | CloudTrail |
| 6 — read the applied quota | Quota headroom check (S3) | Service Quotas |
Cleanup.
aws lambda delete-function --function-name coe-lab
aws logs delete-log-group --log-group-name /aws/lambda/coe-lab
Cost note. A handful of Lambda invocations and a few Logs Insights queries fall within the Free Tier; the log group stores a trivial amount. CloudTrail management events are free to view via lookup-events. Deleting the function and log group leaves nothing billable. If you ever enable CloudTrail data events or create a CloudWatch alarm at scale, those can incur small charges — not needed for this lab.
Common mistakes & troubleshooting
This is the differentiator. Cross-service RCA fails for process reasons as often as technical ones — and the technical traps recur. The playbook below is the table to keep open at 02:14: match the symptom of your investigation or the incident, read the cause, run the exact confirm, apply the fix.
| # | Symptom (of your process or the incident) | Root cause | Confirm (exact command / path) | Fix |
|---|---|---|---|---|
| 1 | Investigation goes in circles; the “fix” doesn’t hold | Comparing signals in different time zones | Check each console/query header — is it local or UTC? | Put every console and query in UTC; one clock, one window |
| 2 | You debug 40 min while customers suffer | Trying to find root cause before mitigating | Is customer impact still live and unmitigated? | Mitigate first (roll back / fail over / raise quota), diagnose after |
| 3 | Three people, three theories, no progress | No Incident Commander | Is anyone coordinating vs everyone debugging? | Name an IC who decides and coordinates and does not debug |
| 4 | You “fixed” the trigger but it recurs | Treated the trigger as the cause | Did the five-whys reach a systemic gap (no alarm/timeout)? | Five-whys to the systemic cause; add the missing guardrail |
| 5 | The one trace you need isn’t in X-Ray | Default sampling dropped it | X-Ray sampling rules show the default reservoir | Raise sampling for the affected service; lean on metrics+logs meanwhile |
| 6 | CloudTrail “shows nothing” around the event | Wrong account, or only management events | Are you in the mgmt/delegated-admin account? Window wide enough? | Check the org trail; widen for delivery latency; enable data events if needed |
| 7 | You blame the failing service | Symptom and cause are in different services | Does the X-Ray map show the fault on a downstream edge? | Follow the service map edge; read golden-signal shapes |
| 8 | You scaled up and it “worked,” then returned worse | Masked a resource ceiling (quota/throttle/saturation) | Do errors track traffic (vanish at rest)? | Find the ceiling (Service Quotas / throttle metric); fix, don’t mask |
| 9 | 502s blamed on the app, but app logs are clean | The load balancer emitted the 502 on timeout | X-Ray shows the request succeeding slowly; client got 502 | Speed up the target; align LB/integration timeout |
| 10 | “One-third fail, two-thirds fine” chased as a bad host | Single-AZ impairment | ALB HealthyHostCount by AZ-ID; PHD event |
Fail away from the AZ; rely on cross-AZ N+1 headroom |
| 11 | Many services AccessDenied, you edit each IAM policy |
The deny is a layer above IAM (SCP/KMS) | CloudTrail AttachPolicy/PutKeyPolicy in mgmt account |
Revert the SCP/key-policy edit — explicit deny wins |
| 12 | Dashboards green but users can’t load; you check the app | Failure is upstream (DNS/cert/edge) | dig +trace; openssl s_client … -dates |
Restore the record / ACM validation CNAME / valid cert |
| 13 | Postmortem names a person | Blameful culture | Does the COE ask “who” or “what about the system”? | Run a blameless COE; fix the system that let it happen |
| 14 | Quota discovered at 100% during a launch | No quota monitoring | Is there an alarm at 80% of this quota? | Alarm on the Service Quotas usage metric at ~80% |
Best practices
- Impose the lifecycle every time, even for small incidents — the muscle memory pays off when it is a Sev-1 at 03:00.
- One Incident Commander; comms on a cadence. Separate coordinating, communicating and debugging into different people.
- Mitigate then diagnose. Optimise for time-to-recovery; the telemetry keeps the evidence for the RCA.
- Everything on one UTC timeline. Metrics, logs, traces and CloudTrail aligned to the same clock and window.
- Read the golden-signal shape first. Latency/traffic/errors/saturation movement classifies the failure before any log line.
- Instrument for correlation before you need it: structured logs with request and trace IDs, X-Ray on the critical path, alarms on the four golden signals and on quotas at 80%.
- Build for failure: multi-AZ with headroom, timeouts + backoff-with-jitter + circuit breakers + bulkheads, tested Multi-AZ failover, queues in front of spiky producers.
- Govern change: IAM/SCP/DNS/cert as reviewed code; alarm on sensitive CloudTrail events.
- Never scale to mask a ceiling. If errors track traffic, find the quota/throttle — scaling up just delays and hides it.
- Close the loop with a blameless COE and track CAPA items to done — an action item without an owner and a date is a wish.
Security notes
- Treat a sudden, broad
AccessDeniedas potentially a security event, not just an outage — it can equally be an attacker’s tightened policy, a compromised credential changing permissions, or a legitimate-but-wrong edit. CloudTrail (including the Organizations trail) is your forensic record; protect it with log-file validation and a separate, locked-down logging account so an attacker cannot cover their tracks. - During an incident, mitigations sometimes tempt you to loosen security (open a security group “just to test”, attach an over-broad policy). Resist, or scope it tightly and time-box it — incident-time exceptions are how durable holes get created. Record any exception in the COE and revert it before you close the incident.
- Least-privilege the responders too: break-glass roles should be auditable, MFA-protected and alerted-on, not shared admin keys.
- For DNS/cert incidents, remember that a hijacked Route 53 record or a mis-issued certificate is a security incident — verify who changed the record (CloudTrail
ChangeResourceRecordSets) and protect hosted zones and ACM with tight IAM and change review. - Keep CloudTrail data events in mind: they are off by default and cost extra, but for sensitive buckets/tables they are the difference between knowing and guessing who read what during a suspected breach.
The security-relevant control-plane events to alarm on, and why each matters during an incident:
| Event | Service | Why alarm on it | During an incident it tells you |
|---|---|---|---|
AttachPolicy / UpdatePolicy |
Organizations (SCP) | Widest blast radius in AWS | An org-wide deny was just introduced |
PutKeyPolicy / RevokeGrant |
KMS | Breaks everything that decrypts | Why many services lost Decrypt |
PutRolePolicy / DeleteRolePolicy |
IAM | Per-service permission change | A specific role gained/lost access |
ChangeResourceRecordSets |
Route 53 | DNS hijack / outage vector | Who edited the record and to what |
AuthorizeSecurityGroupIngress |
EC2 | An incident-time “just to test” hole | A security group was opened up |
StopLogging / DeleteTrail |
CloudTrail | An attacker covering tracks | Your evidence source was tampered with |
PutBucketPolicy / PutBucketAcl |
S3 | Can expose a bucket publicly | A data-exposure change just landed |
ConsoleLogin (failure / new region) |
IAM (CloudTrail) | Credential misuse signal | Who logged in, from where, success or not |
Cost & sizing
The incident method itself is nearly free — the cost is in the instrumentation that makes correlation possible, and the trade-off is “spend a little continuously to avoid spending a lot during an outage.” What drives the observability bill, and how to right-size it:
| Cost driver | What it is | Rough figure | Right-size by |
|---|---|---|---|
| CloudWatch custom metrics | Per-metric monthly charge | ~$0.30/metric/mo (first tier) | Emit only the golden signals + key dependencies |
| CloudWatch Logs ingestion | Per-GB ingested | ~$0.50–0.57/GB (Region-dependent) | Sample/structure logs; drop debug in prod |
| Logs Insights queries | Per-GB scanned | ~$0.005/GB scanned | Narrow the time window; filter early |
| X-Ray traces | Per-trace recorded/retrieved | ~$5 per 1M traces recorded | Sample the critical path; raise only during incidents |
| CloudTrail management events | First copy of mgmt events | Free | Always on; it’s your evidence |
| CloudTrail data events | S3/DynamoDB object-level | ~$0.10 per 100K events | Enable only on sensitive buckets/tables |
| CloudWatch alarms | Per-alarm monthly | ~$0.10/alarm/mo (standard) | Alarm on signals + quotas, not everything |
| Synthetics canaries | Per canary run | ~$0.0012/run | One end-to-end canary per critical journey |
A sizing rule of thumb: the four-golden-signal alarms plus a quota alarm at 80% plus one end-to-end canary per critical user journey is a few hundred rupees a month for most stacks — and it is the difference between a Sev-1 and a Tuesday ticket. The expensive mistake is the opposite: no instrumentation, then a multi-hour outage whose revenue cost dwarfs a year of observability spend. Free Tier covers the lab here entirely; production observability scales with traffic, but the golden-signal subset keeps it bounded.
The cost-of-an-incident framing, which is what justifies the spend:
| If you skip… | The incident it enables | Rough cost of that incident |
|---|---|---|
| A quota alarm at 80% | A launch-time Sev-1 throttle | Revenue per minute × outage minutes |
| Per-AZ healthy-host alarms | A one-AZ impairment chased blind | Extended MTTR; possible full outage if N+1 absent |
A DaysToExpiry cert alarm |
A site-wide cert expiry | Total outage until a cert is reissued |
| An external DNS/cert canary | An edge failure your dashboards can’t see | Customer-reported outage; reputational hit |
| X-Ray on the critical path | A cascade you can’t localise | Hours of MTTR finding the fault edge |
| A blameless COE + tracked CAPA | The same incident next quarter | Repeat outage at full cost — the worst spend of all |
Interview & exam questions
1. Walk me through how you run a multi-service incident. Detect → triage (scope, severity, name an IC and comms lead) → communicate on a cadence → mitigate before fully diagnosing → RCA by correlating metrics/logs/traces/CloudTrail on one UTC timeline → prevent via a blameless COE with tracked CAPA items.
2. The symptom is in service A but A didn’t change — how do you find the cause? Use the X-Ray service map to find the failing edge, read the four golden signals for A and its dependencies to classify the failure shape, then pivot to the dependency’s metrics/logs and to CloudTrail for any change at the inflection minute. The cause is usually a downstream throttle, a saturated shared resource, or a config change elsewhere.
3. Errors rise in proportion to traffic and vanish when load drops — what is it, usually? A quota or limit (Lambda concurrency, API request rate, ENIs, etc.), not a bug. Identify it via the throttle metrics and the Service Quotas console; mitigate with a quota increase / load shedding / queueing; prevent by alarming at 80% of quota.
4. One-third of requests fail, two-thirds are fine, persistently — what’s your first hypothesis? Single-AZ impairment on a three-AZ deployment. Confirm with the Personal Health Dashboard, per-AZ load-balancer/target metrics (by AZ ID, since AZ names are randomised per account), and the data tier (is an RDS/cache primary in that AZ). Fail traffic away and rely on cross-AZ headroom.
5. Difference between a mitigation and a fix — give an example. A mitigation stops customer pain now (roll back the deploy, fail over the AZ, raise the quota); a fix removes the cause (correct the access pattern, fix the policy, restore ACM validation). You mitigate first, then fix; both belong in the COE.
6. Multiple services across accounts throw AccessDenied at the same minute — what changed? Almost certainly a policy layer above the app: an SCP (in the management/delegated-admin account), a KMS key policy, or a shared role/permission boundary. CloudTrail (PutKeyPolicy, Organizations AttachPolicy/UpdatePolicy, PutRolePolicy) shows the edit; explicit deny wins, so revert the change to mitigate.
7. Your dashboards are green but users say the site won’t load — where do you look? Outward — DNS and certificates and edge. dig +trace to validate resolution and openssl s_client to check the served cert’s expiry/subject. Common cause: ACM auto-renewal failed because the DNS validation CNAME was removed; or a Route 53 record/health-check flipped.
8. What’s a cascading failure and how do you prevent it? A small dependency hiccup amplified by retries and missing timeouts until threads/connections exhaust and the failure spreads upstream. Prevent with timeouts, exponential backoff with jitter, circuit breakers and bulkheads, plus alarms on dependency throttles and saturation.
9. How do you correlate a metric spike to a specific change? Note the exact UTC minute of the inflection, then aws cloudtrail lookup-events for that window. A control-plane event in the same minute (a deploy, a quota edit, a policy change) is your trigger.
10. Why blameless postmortems? Because blame drives information underground — people stop sharing what really happened, and you fix symptoms instead of the systemic gaps (missing alarm, no timeout, click-ops policy edit) that let a human error become an outage. The COE asks “what about the system allowed this?”, not “who did it?”
11. X-Ray is sampled — what if the failing trace wasn’t captured? Temporarily raise the sampling rate for the affected service, and in the meantime lean on metrics (the service-map edge statistics are aggregated, not sampled-away) and structured logs keyed by request ID.
12. How do you make quota breaches a non-event? Treat quotas as capacity planning: enable Service Quotas usage metrics and Trusted Advisor limit checks, alarm at ~80% of every quota that scales with traffic, and request increases ahead of known peaks.
The certifications these questions map to:
| Question theme | Maps to | Domain |
|---|---|---|
| Lifecycle, IC, mitigate-first, COE | SOA-C02 | Monitoring, Logging & Remediation |
| Golden signals, correlation, X-Ray map | SOA-C02 / DOP-C02 | Monitoring & Logging |
| Quotas, AZ design, blast radius | SAP-C02 | Continuous improvement / org complexity |
| Blameless postmortem + CAPA | DOP-C02 | Incident & Event Response |
Quick check
- What must you align across CloudWatch, CloudTrail, X-Ray and logs before you can correlate them?
- In the incident lifecycle, what do you do before you fully understand the cause?
- “Errors rise with traffic, recover when traffic drops” — what class of problem is this?
- Which AWS construct, changed in one account, can deny actions across an entire Organization regardless of IAM?
- Your application metrics are healthy but users can’t reach the site — name two things to check.
Answers
- A single UTC timeline (and ideally a shared request/trace ID) — same clock and window across every tool.
- Mitigate — roll back, fail over, raise a quota, shed load — to stop customer impact; diagnose afterwards from the telemetry.
- A service-quota / limit breach (e.g. Lambda concurrency, API throttling), not an application bug.
- A Service Control Policy (SCP) — an explicit deny in an SCP overrides any IAM allow across the OU/Organization.
- DNS (
dig +trace— wrong/empty answer, failover/health-check flip, deleted record) and the TLS certificate (openssl s_client— expired or wrong subject, usually ACM DNS-validation lapse). Edge/WAF is a valid third.
Glossary
- Incident Commander (IC): The single person who coordinates an incident — decides, assigns and protects responders; does not debug.
- Mitigation vs fix: A mitigation restores service now (rollback, failover, quota bump); a fix removes the underlying cause.
- RCA (Root-Cause Analysis): The disciplined search for the true, systemic cause behind an incident, not merely its trigger.
- COE (Correction of Error): Amazon’s blameless postmortem format — timeline, impact, five-whys, contributing factors, and CAPA action items.
- CAPA (Corrective and Preventive Action): The tracked, owned, dated action items that prevent recurrence.
- Golden signals: Latency, traffic, errors, saturation — the four metrics whose shape classifies a failure fast.
- Correlation: Lining up signals from several tools on one UTC clock and window to locate a non-local cause.
- Cascading failure: A small dependency problem amplified (by retries/missing timeouts) until it exhausts resources and spreads upstream.
- Bulkhead / circuit breaker: Resilience patterns that isolate dependencies and fail fast so one slow downstream cannot take down the whole service.
- Blast radius: The breadth of impact of a single change — widest, in AWS, for SCP/IAM/KMS-key-policy edits.
- AZ impairment: A fault confined to one Availability Zone; survivable if you deploy across AZs with headroom.
- Service Quota: An account/Region limit on a resource or API rate; a capacity-planning input, not an incident if monitored.
- Personal Health Dashboard (PHD): The account-specific view of AWS health events affecting your resources.
- Logs Insights: CloudWatch’s query language over log groups — your metric→detail drill-down tool.
- X-Ray service map: The derived call graph across services, with per-edge latency/error/fault statistics — your “where is it failing” view.
Next steps
You now have the operational mindset for incidents that span services. Turn that operational knowledge into design with The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active, which shows how the resilience primitives you just used in mitigation — Multi-AZ, failover, headroom, bulkheads — are baked into architectures from the ground up, so the incidents in this lesson become non-events. Deepen the instrument plane with CloudWatch & CloudTrail Observability Deep Dive and X-Ray: Service Map, Segments & ADOT Tracing, the two tools you triangulate with most. Revisit the per-layer detail in AWS Troubleshooting Playbooks: EC2, VPC, IAM, S3 & Lambda, and when you are ready to certify, the AWS Certification Prep Kit (CLF, SAA, SOA, DVA, SAP, DOP) maps this lesson to the exact SOA-C02 and SAP-C02 domains it covers.