AWS Troubleshooting

Advanced AWS Troubleshooting: Complex Multi-Service Incidents & Root-Cause Analysis

A single-service problem is a puzzle; a multi-service incident is a crime scene. When an EC2 instance will not accept SSH you can reason about it in isolation — security group, key, subnet, status checks — and the previous lesson gave you those playbooks. But production rarely fails that politely. Real incidents arrive as a payments API at 30% error rate, and the cause turns out to be a DynamoDB table that started throttling because a deploy three hours ago doubled its read pattern, which exhausted the connection pool in a downstream Lambda, which tripped a circuit breaker in the upstream service, which is why the checkout page — three hops away — is timing out. Nobody changed checkout. The symptom and the cause are in different services, owned by different teams, and the only thing that ties them together is a trace ID and a timeline.

This lesson is about that kind of incident. The skill is no longer “do I know this service” — it is correlation under pressure: imposing a lifecycle on the chaos, reading three or four observability tools against one clock, forming hypotheses you can cheaply disprove, and writing the whole thing down afterwards so it never recurs. We will walk the incident-response lifecycle, the correlation toolkit (CloudWatch metrics and Logs Insights, CloudTrail, X-Ray, the Health Dashboard, Trusted Advisor), five fully worked complex scenarios as symptom → hypotheses → cross-service diagnosis → fix, the role of Service Quotas, and the blameless postmortem (Amazon’s COE) that closes the loop. This is the SOA-C02 and SAP-C02 operational-excellence mindset, written the way an on-call architect actually works it.

Learning objectives

By the end of this lesson you will be able to:

Prerequisites

You should be comfortable with the single-service troubleshooting method and playbooks from AWS Troubleshooting Playbooks: EC2, VPC, IAM, S3 & Lambda — reproduce, isolate the layer, check config against desired state, inspect CloudWatch and CloudTrail, hypothesise, fix, verify, prevent. You should know IAM policy evaluation (explicit deny beats allow beats implicit deny), how VPC routing and security groups work, and roughly what CloudWatch, CloudTrail and X-Ray each record. This lesson sits in the Troubleshooting & operations module of the Zero-to-Hero track, immediately after the single-service playbooks and before the architecting ladder. It is mapped to SOA-C02 (AWS Certified SysOps Administrator) and the operational sections of SAP-C02 (Solutions Architect Professional).

The incident-response lifecycle

Under pressure, ad-hoc investigation produces tunnel vision: three engineers staring at the same dashboard while nobody talks to the customer or writes anything down. A lifecycle is the antidote. It is not bureaucracy — it is the set of jobs that must happen concurrently, and naming them lets you assign them.

Phase Goal Key actions Done when
Detect Know there is an incident, fast Alarms (CloudWatch composite + SLO burn-rate), synthetic canaries, customer reports, Health Dashboard events An incident is declared with a severity and a single owner
Triage Size the blast radius and assign roles Confirm scope (one customer / one AZ / one Region / global), set severity, name an Incident Commander (IC), a comms lead and an ops lead Everyone knows their role and the severity is agreed
Communicate Keep stakeholders ahead of the rumour Status page / internal channel updates on a cadence (e.g. every 30 min for Sev-1), business-impact statement Stakeholders are updated and the cadence is running
Mitigate Stop customer pain before you understand it fully Roll back, fail over, raise a quota, shed load, disable a feature flag — anything that restores service Customer impact is gone or sharply reduced
RCA Find the true cause, not the trigger Correlate logs/metrics/traces/CloudTrail on one timeline, five-whys The causal chain is written down and agreed
Prevent Make recurrence impossible or detectable CAPA action items with owners and dates, new alarms, guardrails, runbook updates Action items are tracked to completion

Two principles separate seniors from juniors here. First: mitigate before you fully diagnose. The instinct to find the cause before acting is wrong during customer impact — if a rollback or failover restores service, do it, then investigate from a calm seat. The cause is still in the telemetry. Second: one Incident Commander. The IC does not debug; they coordinate, decide, and protect the responders from interruptions. The fastest incidents I have run had an IC who never touched a keyboard.

A note on severity: pick a simple scale and stick to it. A common one is Sev-1 (major customer-facing outage, all hands, exec comms), Sev-2 (significant degradation or single-AZ/single-feature impact), Sev-3 (minor or internal-only). Severity drives the comms cadence and who gets paged — it is a routing decision, not a judgement of blame.

The correlation toolkit: reading four tools against one clock

The defining move of multi-service RCA is correlation. The cause lives in a different service from the symptom, so you must line up signals from several tools on a single, UTC timeline. Configure every console and query to UTC during an incident — the single most common diagnostic error is comparing a log in local time against a metric in UTC and “proving” the wrong thing.

Here is what each tool is for, and crucially what it is not for:

Tool Answers Best for Blind spot
CloudWatch metrics “What changed, and when?” Spotting the inflection point — latency, errors, throttles, saturation Aggregates hide per-request detail; high cardinality is expensive
CloudWatch Logs Insights “What exactly happened in this service?” Querying structured logs across log groups for patterns, error messages, request IDs Only as good as your logging; cross-service joins are manual
CloudTrail Who did what to the control plane, and when?” Tying an incident to a config change, deploy, IAM/SCP edit, quota change Management events only by default; data events cost extra; ~minutes of delivery latency
X-Ray (traces) “Where in the call graph is the time/error going?” Following one request across services; the service map shows the failing edge Sampled by default — the one trace you want may not be captured
AWS Health Dashboard “Is this AWS’s problem?” Ruling in/out an AWS-side event in your account/Region (PHD shows account-specific) Not real-time to the second; absence of an event is not proof
Trusted Advisor “Am I near a limit or misconfigured?” Pre-incident hygiene and quota headroom checks Periodic, not a live diagnostic

The high-leverage skill is the golden-signal sweep: in the first two minutes, look at latency, traffic, errors and saturation (the four golden signals) for the affected service and its immediate dependencies, all on the same time window. The shape tells you the class of problem before you read a single log line:

Logs Insights and CloudTrail: the two queries you will type most

When errors spike, this Logs Insights query finds the dominant failure mode fast — bin by minute, count by status, and you see both the onset and the breakdown:

fields @timestamp, @message
| filter status >= 500 or @message like /Throttl|Timeout|ProvisionedThroughputExceeded/
| stats count(*) as errors by bin(1m), errorType
| sort errors desc

And when a metric shows a step change at a specific minute, this CloudTrail lookup answers “what changed”:

# Who/what touched the control plane around the inflection point (UTC)
aws cloudtrail lookup-events \
  --start-time 2026-06-15T14:25:00Z \
  --end-time   2026-06-15T14:35:00Z \
  --query 'Events[].{Time:EventTime,User:Username,Event:EventName,Src:EventSource}' \
  --output table

Pair a step change in a metric with a CloudTrail event in the same minute and you have usually found your trigger. For request-level correlation, propagate a request ID (and the X-Ray trace ID) through your logs so you can pivot from a metric, to the traces on that edge, to the exact log lines for that request.

AWS complex incident response & RCA

The diagram above ties it together: the lifecycle across the top, and beneath it the correlation plane where one UTC timeline links a CloudWatch metric inflection, a CloudTrail change event, an X-Ray service-map fault edge and the Logs Insights detail — the four views you triangulate to move from symptom to root cause.

Worked scenario 1 — Cascading failure from a throttled dependency

Symptom. The checkout API’s p99 latency climbs from 200 ms to 8 s over ten minutes, then it starts returning 5xx. Traffic is flat — no marketing spike. The on-call for checkout swears they deployed nothing.

Hypotheses.

  1. Checkout itself regressed (deploy, memory leak, GC). Unlikely — no deploy, and latency rose before errors.
  2. A downstream dependency slowed or started rejecting, and checkout’s threads/connections are now blocked waiting on it (back-pressure cascade).
  3. A shared resource (a database, a connection pool, a Lambda concurrency limit) is saturated and several services are contending for it.

Cross-service diagnosis. The golden-signal shape — latency up first, then errors, traffic flat — points away from load and towards a slow or rejecting dependency. Open the X-Ray service map for the checkout request: it shows checkout → order-service → a DynamoDB table, and the order-service → DynamoDB edge is red with high latency and a fault rate. Pivot to CloudWatch and overlay the table’s ThrottledRequests (or ReadThrottleEvents) metric — it went non-zero exactly when checkout’s latency began to climb. Now the question is why is the table throttling on flat traffic? CloudTrail lookup-events for the preceding hour shows an UpdateItem-heavy deploy on a different service that began writing a new attribute, and a Logs Insights query on order-service shows it is doing a Scan where it used to Query because a feature flag flipped. The Scan reads the whole partition, blows the table’s provisioned (or hot-partition) capacity, DynamoDB throttles, order-service retries (amplifying load), its connection pool fills, and checkout — waiting on order-service — backs up and finally times out. Classic cascade: the symptom is in checkout, the trigger is a flag on a third service, the bottleneck is one DynamoDB partition.

Fix.

The lesson: retries and missing timeouts turn a small dependency hiccup into a system-wide outage. Backoff with jitter, timeouts, circuit breakers and bulkheads are the resilience primitives that contain a cascade.

Worked scenario 2 — Availability Zone impairment and failover

Symptom. Error rate jumps to roughly one-third of requests and p99 latency spikes, but two-thirds of requests are perfectly healthy. The pattern is partial and persistent — refreshes sometimes succeed, sometimes fail.

Hypotheses.

  1. A bad deploy on a subset of instances (canary gone wrong).
  2. One AZ is impaired — networking, a backing service, or capacity in that zone.
  3. A single unhealthy backend (one RDS replica, one cache node) serving a fraction of traffic.

Cross-service diagnosis. “One-third failing, two-thirds fine” across a three-AZ deployment is the signature of single-AZ impairment. Confirm it three ways. First, the AWS Health Dashboard (Personal Health Dashboard) — check for an account-specific event naming an AZ ID (note: AZ names like us-east-1a are randomised per account; the Health event and your metrics should be reconciled by AZ ID, e.g. use1-az2). Second, break the load balancer’s metrics down by AZ: target group healthy-host count has dropped in one AZ and HTTPCode_Target_5XX_Count is concentrated there. Third, check the data tier — if RDS is Multi-AZ, is the primary in the impaired zone? Is one ElastiCache shard’s primary there? CloudWatch per-AZ metrics and the target-health view localise it. CloudTrail will be quiet — this is not a change you made, which itself is a strong signal that the cause is infrastructure, not config.

Fix.

The lesson: design assuming an AZ will fail, then an AZ failure is a non-event. The incident is only severe if your architecture cannot tolerate losing one zone.

Worked scenario 3 — A service-quota breach under load

Symptom. During a traffic surge, a fraction of API requests fail with errors that mention Rate exceeded, LimitExceeded, or ThrottlingException — and the failures track traffic: more load, more failures, and they vanish when load drops. Nothing is “down”; the system is hitting an invisible ceiling.

Hypotheses.

  1. An application-level rate limit or a downstream API throttle.
  2. A Lambda concurrency or burst limit (account or function-level reserved concurrency).
  3. An account Service Quota — API request rate, ENIs per Region, concurrent executions, EIPs, etc. — being exceeded as load scales.

Cross-service diagnosis. The “fails in proportion to load, recovers when load drops” shape is the fingerprint of a quota or limit, not a bug. Identify which ceiling. For Lambda, CloudWatch shows Throttles rising with invocations while ConcurrentExecutions flatlines at a round number (1,000 by default, or your reserved figure) — that is the concurrency limit. For API throttling, the error envelope names the service; check the Service Quotas console for that service’s “applied quota value” and compare against your CloudWatch usage metrics (many quotas now publish a usage metric you can alarm on). CloudTrail can show whether someone recently lowered a reserved concurrency or whether a new function ate the account pool. The tell that distinguishes this from scenario 1: here the errors are immediate throttles that scale with traffic, not latency-then-timeout from a slow dependency.

Fix.

The lesson: quotas are a capacity-planning problem, not an incident. If you discover a quota during an outage, you have a monitoring gap, not just a limit.

Worked scenario 4 — An IAM/SCP change with a wide blast radius

Symptom. At a precise timestamp, multiple unrelated services across one or several accounts start failing with AccessDenied — uploads to S3, a Lambda that can no longer write to DynamoDB, an ECS task that cannot pull from ECR. No application deploy happened. The breadth is the clue: many services, one moment, same error class.

Hypotheses.

  1. A KMS key policy or grant was changed and everything that decrypts through it now fails (a wide but specific blast radius).
  2. An IAM change — a shared role’s permissions, or a permission boundary tightened.
  3. A Service Control Policy (SCP) at the Organization/OU level changed and is now denying actions org-wide regardless of IAM (SCPs set the maximum permissions; an explicit deny there wins everywhere).

Cross-service diagnosis. When AccessDenied appears across many services at the same minute, suspect a policy layer above the application, because identity-based policies are usually edited per service. Go straight to CloudTrail and look up PutBucketPolicy, PutKeyPolicy, PutRolePolicy, DeleteRolePolicy, PutPermissionsBoundary, and crucially the Organizations events UpdatePolicy/AttachPolicy (SCPs are recorded in the management/delegated-admin account’s CloudTrail). The timestamp of the change will line up exactly with the onset. To prove which layer is denying, take one failing call and reason through the evaluation chain: an SCP deny blocks the action no matter what IAM allows, so if the IAM policy looks correct but the call still fails org-wide, the SCP (or a permission boundary, or a resource-policy/KMS-key-policy deny) is the culprit. CloudTrail’s errorCode: AccessDenied entries, combined with knowing that explicit deny always wins, let you walk from symptom to the exact policy edit. IAM Access Analyzer can help confirm what a principal can actually do after the change.

Fix.

The lesson: a one-line policy edit can have the widest blast radius in AWS. The layered evaluation model (SCP → resource policy → permission boundary → identity policy, with explicit deny trumping all) is exactly what you walk during diagnosis — and exactly why these changes belong in a reviewed pipeline.

Worked scenario 5 — A Route 53 / DNS or certificate failure

Symptom. Users report “the site won’t load” or “your connection is not private,” but your load balancer and application metrics look completely healthy — low latency, no 5xx, normal CPU. The failure is happening before traffic reaches your infrastructure, which is why your dashboards are clean.

Hypotheses.

  1. DNS resolution is failing or returning the wrong answer (a record was changed/deleted, a failover/health-check flipped, an NS delegation issue).
  2. A TLS certificate expired or the wrong certificate is being served (ACM cert not renewed because validation lapsed; cert/domain mismatch).
  3. CDN/edge or WAF is blocking or misrouting at the edge.

Cross-service diagnosis. Healthy backend metrics with “can’t load” reports is the signature of an edge/DNS/cert problem. Split the two possibilities at the command line. For DNS, resolve the name and inspect the answer end to end:

dig +trace example.com           # follows the delegation chain NS by NS
dig example.com A @8.8.8.8       # what a public resolver actually returns

If dig returns the wrong IP, an empty answer, or SERVFAIL, the problem is DNS. In Route 53, check whether a health check flipped a failover record (a failing health check will route away from the healthy endpoint), whether a record was edited (CloudTrail ChangeResourceRecordSets shows the change and the actor), and whether the hosted zone’s NS records still match the registrar’s delegation. For TLS, inspect the served certificate’s expiry and subject:

echo | openssl s_client -connect example.com:443 -servername example.com 2>/dev/null \
  | openssl x509 -noout -dates -subject -issuer

An expired notAfter, or a subject/SAN that does not cover the host, is your cause. With ACM, expiry almost always traces back to DNS validation breaking (the CNAME validation record was removed, so ACM could not auto-renew) — and CloudTrail/ACM events will show the renewal failures.

Fix.

The lesson: when your metrics are healthy but users cannot reach you, look outward — DNS, certificate, edge — because the failure is upstream of everything your CloudWatch sees.

Service Quotas, Trusted Advisor and the Health Dashboard as prevention

Three of the five scenarios above (and most real incidents) are preventable with hygiene that lives outside the application:

The pattern across all three: convert future incidents into present-day alarms. A quota you alarm on at 80% is a Tuesday-afternoon ticket; the same quota discovered at 100% during a launch is a Sev-1.

Hands-on lab — correlate a self-inflicted incident (Free Tier)

You will create a tiny, safe incident and practise the correlation workflow end to end, then clean up. Everything here is Free-Tier-eligible if you tear it down promptly.

1. Set up a function and a log group. Create a minimal Lambda (any runtime) named coe-lab that logs a structured line and occasionally “fails”:

aws lambda create-function --function-name coe-lab \
  --runtime python3.13 --handler index.handler --timeout 5 \
  --role arn:aws:iam::111122223333:role/your-lambda-basic-role \
  --zip-file fileb://function.zip

(function.zip contains an index.py whose handler prints {"level":"INFO","reqId":context.aws_request_id,...} and raises an exception when the event has {"fail": true}.)

2. Generate a signal. Invoke it a few times, including failures, to put both success and error lines into CloudWatch Logs:

for i in $(seq 1 20); do
  aws lambda invoke --function-name coe-lab --payload '{"fail":false}' /dev/null >/dev/null
done
aws lambda invoke --function-name coe-lab --payload '{"fail":true}' /dev/null

3. Triage with metrics. In the CloudWatch console, open the Errors, Invocations and Throttles metrics for coe-lab (set the timezone to UTC). Note the minute the error appears — that is your inflection point.

4. Drill in with Logs Insights. Run a query against the function’s log group:

fields @timestamp, @message
| filter @message like /ERROR|Traceback|"level":"ERROR"/
| stats count(*) as errors by bin(1m)
| sort @timestamp desc

Confirm the error count and timing match the metric. This is the metric→logs pivot you will do in every real incident.

5. Tie it to a change with CloudTrail. Update the function’s configuration (a harmless change) to create a control-plane event:

aws lambda update-function-configuration --function-name coe-lab --timeout 6
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=coe-lab \
  --query 'Events[].{Time:EventTime,User:Username,Event:EventName}' --output table

You should see your UpdateFunctionConfiguration event with its timestamp and actor — exactly how you would prove “someone changed this at 14:31”.

Validation. You have, on one UTC timeline, (a) a metric inflection, (b) the matching log lines, and © the control-plane change that explains a config-driven incident. That is the core RCA loop in miniature.

Cleanup.

aws lambda delete-function --function-name coe-lab
aws logs delete-log-group --log-group-name /aws/lambda/coe-lab

Cost note. A handful of Lambda invocations and a few Logs Insights queries fall within the Free Tier; the log group stores a trivial amount. CloudTrail management events are free to view via lookup-events. Deleting the function and log group leaves nothing billable. If you ever enable CloudTrail data events or create a CloudWatch alarm at scale, those can incur small charges — not needed for this lab.

Common mistakes & troubleshooting

Symptom of your process Cause Fix
Investigation goes in circles; the “fix” doesn’t hold Comparing signals in different time zones Put every console and query in UTC; align all tools to one clock
You debug for 40 minutes while customers suffer Trying to find root cause before mitigating Mitigate first (roll back / fail over / raise quota), diagnose after
Three people, three theories, no progress No Incident Commander Name an IC who coordinates and decides and does not debug
You “fixed” the trigger but it recurs Treated the trigger as the cause Five-whys to the systemic cause (missing timeout, no alarm, no review)
The one trace you need isn’t in X-Ray Default sampling dropped it Raise sampling for the affected service during the incident; rely on metrics+logs meanwhile
CloudTrail “shows nothing” around the event Looking at the wrong account or only management events Check the management/delegated-admin account for SCP/Org events; widen the window for delivery latency
You blame the failing service Symptom and cause are in different services Use the service map and golden-signal shapes to find the upstream cause
Postmortem names a person Blameful culture Run a blameless COE — fix the system that let a human error become an outage

Best practices

Security notes

Interview & exam questions

1. Walk me through how you run a multi-service incident. Detect → triage (scope, severity, name an IC and comms lead) → communicate on a cadence → mitigate before fully diagnosing → RCA by correlating metrics/logs/traces/CloudTrail on one UTC timeline → prevent via a blameless COE with tracked CAPA items.

2. The symptom is in service A but A didn’t change — how do you find the cause? Use the X-Ray service map to find the failing edge, read the four golden signals for A and its dependencies to classify the failure shape, then pivot to the dependency’s metrics/logs and to CloudTrail for any change at the inflection minute. The cause is usually a downstream throttle, a saturated shared resource, or a config change elsewhere.

3. Errors rise in proportion to traffic and vanish when load drops — what is it, usually? A quota or limit (Lambda concurrency, API request rate, ENIs, etc.), not a bug. Identify it via the throttle metrics and the Service Quotas console; mitigate with a quota increase / load shedding / queueing; prevent by alarming at 80% of quota.

4. One-third of requests fail, two-thirds are fine, persistently — what’s your first hypothesis? Single-AZ impairment on a three-AZ deployment. Confirm with the Personal Health Dashboard, per-AZ load-balancer/target metrics (by AZ ID, since AZ names are randomised per account), and the data tier (is an RDS/cache primary in that AZ). Fail traffic away and rely on cross-AZ headroom.

5. Difference between a mitigation and a fix — give an example. A mitigation stops customer pain now (roll back the deploy, fail over the AZ, raise the quota); a fix removes the cause (correct the access pattern, fix the policy, restore ACM validation). You mitigate first, then fix; both belong in the COE.

6. Multiple services across accounts throw AccessDenied at the same minute — what changed? Almost certainly a policy layer above the app: an SCP (in the management/delegated-admin account), a KMS key policy, or a shared role/permission boundary. CloudTrail (PutKeyPolicy, Organizations AttachPolicy/UpdatePolicy, PutRolePolicy) shows the edit; explicit deny wins, so revert the change to mitigate.

7. Your dashboards are green but users say the site won’t load — where do you look? Outward — DNS and certificates and edge. dig +trace to validate resolution and openssl s_client to check the served cert’s expiry/subject. Common cause: ACM auto-renewal failed because the DNS validation CNAME was removed; or a Route 53 record/health-check flipped.

8. What’s a cascading failure and how do you prevent it? A small dependency hiccup amplified by retries and missing timeouts until threads/connections exhaust and the failure spreads upstream. Prevent with timeouts, exponential backoff with jitter, circuit breakers and bulkheads, plus alarms on dependency throttles and saturation.

9. How do you correlate a metric spike to a specific change? Note the exact UTC minute of the inflection, then aws cloudtrail lookup-events for that window. A control-plane event in the same minute (a deploy, a quota edit, a policy change) is your trigger.

10. Why blameless postmortems? Because blame drives information underground — people stop sharing what really happened, and you fix symptoms instead of the systemic gaps (missing alarm, no timeout, click-ops policy edit) that let a human error become an outage. The COE asks “what about the system allowed this?”, not “who did it?”

11. X-Ray is sampled — what if the failing trace wasn’t captured? Temporarily raise the sampling rate for the affected service, and in the meantime lean on metrics (the service-map edge statistics are aggregated, not sampled-away) and structured logs keyed by request ID.

12. How do you make quota breaches a non-event? Treat quotas as capacity planning: enable Service Quotas usage metrics and Trusted Advisor limit checks, alarm at ~80% of every quota that scales with traffic, and request increases ahead of known peaks.

Quick check

  1. What must you align across CloudWatch, CloudTrail, X-Ray and logs before you can correlate them?
  2. In the incident lifecycle, what do you do before you fully understand the cause?
  3. “Errors rise with traffic, recover when traffic drops” — what class of problem is this?
  4. Which AWS construct, changed in one account, can deny actions across an entire Organization regardless of IAM?
  5. Your application metrics are healthy but users can’t reach the site — name two things to check.

Answers

  1. A single UTC timeline (and ideally a shared request/trace ID) — same clock and window across every tool.
  2. Mitigate — roll back, fail over, raise a quota, shed load — to stop customer impact; diagnose afterwards from the telemetry.
  3. A service-quota / limit breach (e.g. Lambda concurrency, API throttling), not an application bug.
  4. A Service Control Policy (SCP) — an explicit deny in an SCP overrides any IAM allow across the OU/Organization.
  5. DNS (dig +trace — wrong/empty answer, failover/health-check flip, deleted record) and the TLS certificate (openssl s_client — expired or wrong subject, usually ACM DNS-validation lapse). Edge/WAF is a valid third.

Exercise

Take a real or representative architecture you know (say: CloudFront → ALB → ECS → RDS Multi-AZ, with a Lambda doing async work off SQS). For each of the five scenario classes in this lesson, write a one-page runbook entry: the symptom as it would actually page you, the first three hypotheses in priority order, the exact CloudWatch metric, Logs Insights query and CloudTrail lookup you would run to confirm, the mitigation (and how to execute it for your stack), the fix, and the one alarm or guardrail that would turn this incident into a warning. Then pick the scenario you are least prepared for and actually create that alarm in a sandbox account. Bonus: run a 30-minute game day with a colleague playing IC while you inject one of the failures (e.g. lower a Lambda’s reserved concurrency to 1 under load) and practise the lifecycle for real.

Certification mapping

Glossary

Next steps

You now have the operational mindset for incidents that span services. Next, turn that operational knowledge into design: The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active (aws-architecting-ladder-static-site-to-multi-region) shows how the resilience primitives you just used in mitigation — Multi-AZ, failover, headroom, bulkheads — are baked into architectures from the ground up, so the incidents in this lesson become non-events. If you have not yet, revisit the single-service playbooks in AWS Troubleshooting Playbooks: EC2, VPC, IAM, S3 & Lambda — they are the per-layer detail that the cross-service diagnosis here builds upon.

awstroubleshootingincident-responseroot-cause-analysisobservabilitysre
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading