AWS Troubleshooting

AWS Troubleshooting Playbooks: EC2, VPC, IAM, S3 & Lambda

There is a particular kind of panic that hits when something in AWS breaks in production. An EC2 instance you could reach yesterday refuses SSH; a Lambda function that worked in testing starts timing out under load; a perfectly valid S3 request comes back 403 AccessDenied; an IAM role that definitely has the right policy is told it can’t call an API. The temptation is to start clicking — open the security group to 0.0.0.0/0, bump the Lambda timeout to the maximum, attach AdministratorAccess to the role, make the bucket public — and hope. That is gambling, not troubleshooting, and it is how a five-minute incident becomes a two-hour one with three new security holes layered on top of the original fault.

This lesson teaches the opposite habit: a repeatable method that turns “it’s broken and I don’t know why” into a short, ordered set of questions that converge on the real cause — plus per-service playbooks (EC2, VPC, IAM, S3, Lambda) mapping a symptom to its likely cause, the one diagnostic that confirms it, and the fix. A senior engineer is not someone who has memorised every error code; it is someone who can take an unfamiliar failure and narrow it down calmly. By the end you will have that instinct, plus a reference to keep open during an incident.

This is a methodology lesson — it teaches how to think, using just enough of each tool to act. The next lesson, Advanced AWS Troubleshooting: Complex Multi-Service Incidents & Root-Cause Analysis, takes the same method up a level to incidents that span several services at once, with CloudWatch Logs Insights, X-Ray and the AWS Health Dashboard. Everything here maps to SOA-C02 (AWS Certified SysOps Administrator – Associate), where troubleshooting is a major exam domain.

Learning objectives

By the end of this lesson you can:

Prerequisites & where this fits

You should already understand AWS’s core building blocks: the account and Region/AZ model, IAM users, roles and policies, VPCs with subnets, route tables and security groups, EC2 instances, S3 buckets and Lambda functions — all covered in earlier lessons. You needn’t be an expert in any; troubleshooting is precisely the skill of reasoning about a system you only partly understand. This is Lesson A3, the first of two in the Troubleshooting & Operations module of the AWS Zero-to-Hero course. It builds directly on AWS IAM Fundamentals (the policy-evaluation logic the IAM playbook leans on) and AWS VPC Networking Fundamentals (the routing and security-group model the VPC playbook leans on), and it leads into the complex-incident lesson. Everything here maps to SOA-C02, where “Troubleshooting and Optimization” is an exam domain.

The troubleshooting mindset: eight steps that always work

Tools change; the method does not. Whether you are debugging a 2003-era on-prem server or a 2026 AWS landing zone, the same loop applies. Internalise these eight steps and you will never again be the person frantically toggling settings.

# Step The question it answers Why it matters
1 Reproduce Can I make it fail on demand? A fault you can’t reproduce, you can’t confirm you’ve fixed. Pin down exactly who, what, where, and when.
2 Isolate the layer Which layer is actually failing — identity, network, the resource, or the app? This is the master move. Most wasted time comes from fixing the wrong layer.
3 Config vs desired Does the current configuration match what I intended? Most AWS incidents are config drift or a recent change, not a platform fault.
4 Inspect CloudWatch & CloudTrail What does the evidence say? Metrics, logs and the API audit trail are ground truth. Read them before theorising, not after.
5 Hypothesise & test What’s my single best guess, and what one test confirms or kills it? One variable at a time. A test that can only “succeed” proves nothing.
6 Fix What is the smallest change that addresses the root cause? Fix the cause, not the symptom; change one thing so you know what worked.
7 Verify Is it actually fixed — from the user’s perspective? Re-run the reproduction from step 1. “It looks fine in the console” is not verification.
8 Prevent How do I make sure this never silently recurs? Turn the fix into an alarm, a guardrail (SCP/Config rule), a runbook, or a test. This is what makes you senior.

A few principles make the loop sharper:

Isolating the layer — the master skill

Step 2 deserves its own model because it is where time is won or lost. Almost every AWS failure lives in one of five layers. Ask the questions top to bottom and you will usually localise the fault in under a minute:

Layer “Is the problem here?” — quick test Typical symptoms
Identity / authorization Can the caller authenticate, and does its policy allow this action on this resource? AccessDenied, UnauthorizedOperation, 403 on a control-plane call, “not authorized to perform”
Network / connectivity Can the packet physically reach the target on the right port? Timeouts, “connection refused”, can’t SSH/RDP, intermittent drops
DNS / name resolution Does the name resolve to the IP (or endpoint) you expect? “could not resolve host”, connecting to a public IP of a private resource, gateway/VPC-endpoint misses
The resource / service Is the resource itself healthy, running and within its limits? Status-check failure, throttling, 503 from a stopped backend, service-quota errors
The application Is the code/config inside the resource the problem? App-level 500, stack traces, bad connection string, unhandled exception in a Lambda

The trick is to test the cheapest, most likely layer first and to bisect: if a request fails from one machine but succeeds from another, the difference between them is your fault. If the same request fails everywhere, the problem is central (the resource or its config), not the caller.

AWS troubleshooting decision tree

The decision tree above is the same logic rendered as a flowchart: start from the symptom, ask “can it authenticate and is it authorised?”, then “can the packet arrive?”, then “does the name resolve?”, then “is the resource healthy and within quota?”, and finally “is it the app?” — branching to the matching playbook below at each point.

EC2 playbook: “I can’t connect to my instance”

Failing to reach an EC2 instance is the single most common AWS support case, and almost all of them come down to a handful of causes: a security group not allowing the port, a network ACL denying it, the instance in a subnet with no route to the internet, a missing or wrong key pair, or the instance failing its status checks. Beginners stare at the operating system; the fix is to inspect the network path and the instance state AWS actually applied, from the outside, before touching the box.

Two facts shortcut most cases. Security groups are stateful (if you allow inbound on a port, the return traffic is automatically allowed) and have allow rules only. Network ACLs are stateless (you must allow both the inbound request and the outbound ephemeral-port response, e.g. 1024–65535) and can have explicit deny rules. The instance status check (does the instance reach the network?) and the system status check (is the underlying host healthy?) are two different signals — read both before you reboot anything. EC2 Instance Connect, the serial console, and EC2 Instance Connect Endpoint (browser-based SSH with no inbound port and no public IP) let you in when normal SSH is blocked.

Symptom Likely cause Diagnostic step Fix
SSH/RDP times out (no response) Security group not allowing 22/3389 from your IP, or NACL denying it Check the security group inbound rules; check the subnet NACL (both directions); confirm a public IP/route Add an inbound Allow for the port scoped to your IP (not 0.0.0.0/0); fix the NACL; ensure a public IP + IGW route
Connection refused (fast reject) sshd/RDP not running, wrong port, or host firewall (iptables/Windows Firewall) inside the OS Use EC2 Serial Console or Instance Connect; check the service is listening Start/repair the service; bind the right port; open the in-guest firewall
“Permission denied (publickey)” Wrong key pair, wrong username, or bad ~/.ssh/authorized_keys permissions Confirm the key matches the instance’s key pair; use the right user (ec2-user, ubuntu, admin) Use the correct .pem; fix chmod 600 on the key; repair authorized_keys via serial console
Reachable from one place, not another The instance has no public IP / is in a private subnet; you need a bastion or endpoint Check the instance’s public/private IP and the subnet’s route table Use a bastion, Session Manager, or EC2 Instance Connect Endpoint for private instances
Instance status check failed (1/2 or 2/2) OS-level problem (instance check) or impaired host (system check) EC2 console → Status checks; read instance vs system result Instance check: reboot / fix the OS via serial console. System check: stop/start to move to a healthy host
Instance stuck in stopping/pending or won’t start Underlying capacity issue, or InsufficientInstanceCapacity in the AZ CloudTrail + EC2 events; try another AZ/instance type Stop/start to relocate; launch in a different AZ/type; consider a capacity reservation
Connect by DNS name fails, IP works VPC DNS resolution/hostnames disabled, or stale record Check VPC enableDnsSupport/enableDnsHostnames; nslookup the name Enable DNS support/hostnames on the VPC; fix the Route 53 record
High CPU credit / sudden slowdown (T-family) CPU credit balance exhausted on a burstable instance CloudWatch CPUCreditBalance / CPUSurplusCreditBalance Enable unlimited mode, move to a larger/non-burstable type, or right-size

A grounding example: an instance is unreachable on port 22. The security group shows an inbound Allow 22 from your IP, so identity-of-port is fine — but the subnet’s NACL has a DENY on the ephemeral range, so the response never returns. You found it by remembering NACLs are stateless, in two read-only checks, without ever rebooting the box.

VPC playbook: “there’s no connectivity”

VPC connectivity issues are about the path a packet takes. The usual suspects are a route table missing the route, the internet gateway (IGW) or NAT gateway absent or misconfigured, a security group or network ACL blocking traffic, a peering/Transit Gateway route missing on one side, or DNS resolving the wrong endpoint. The decisive tool is the VPC Reachability Analyzer, which computes — without sending a packet — whether traffic from a source ENI can reach a destination, and names the blocking component (the security group, NACL, route, or gateway) when it cannot.

The mental checklist for “private subnet can’t reach the internet” is: public subnets route 0.0.0.0/0 to an IGW; private subnets route 0.0.0.0/0 to a NAT gateway that itself sits in a public subnet. For “can’t reach S3/DynamoDB privately”, you want a gateway VPC endpoint with a route; for other services, an interface endpoint (PrivateLink) with the right security group and private DNS.

Symptom Likely cause Diagnostic step Fix
Public instance has no internet Missing 0.0.0.0/0IGW route, or no public IP Check the subnet route table for an IGW route; confirm an Elastic/public IP Add 0.0.0.0/0 → igw-…; assign a public IP / Elastic IP
Private instance has no outbound internet No 0.0.0.0/0NAT gateway, or the NAT is in a private subnet Route table for the NAT route; confirm the NAT sits in a public subnet with an IGW route Add 0.0.0.0/0 → nat-…; place the NAT in a public subnet with an EIP
Traffic silently dropped between instances Security group or NACL blocking; SG doesn’t reference the peer SG Reachability Analyzer source→dest; check SG/NACL Allow the port (reference the source SG as the source); fix the NACL both directions
Can’t reach a peered VPC Missing route to the peer CIDR on one side, or overlapping CIDRs Check route tables on both VPCs; verify non-overlapping CIDRs Add the peer-CIDR route on both sides; peering can’t route overlapping ranges
Can’t reach S3/DynamoDB from a private subnet No gateway endpoint or its route missing Check for a gateway VPC endpoint and the prefix-list route Create the gateway endpoint; it adds a managed prefix-list route automatically
Interface (PrivateLink) endpoint unreachable Endpoint security group blocks 443, or private DNS off Check the endpoint SG; resolve the service FQDN Allow 443 to the endpoint SG; enable private DNS on the endpoint
Resolves to a public IP of a private resource Private DNS/Route 53 private hosted zone not associated nslookup the FQDN — expect a private IP Associate the private hosted zone with the VPC; enable DNS hostnames
Intermittent outbound failures under load NAT gateway SNAT port exhaustion, or NAT/EIP throughput limit CloudWatch NAT ErrorPortAllocation/ActiveConnectionCount Add NAT gateways (per-AZ), reuse connections, or move to multiple destinations

A grounding example: a private EC2 instance can’t reach the internet. Reachability Analyzer reports not reachable, blocked at the route table — there’s no 0.0.0.0/0 to the NAT. You fixed the route, not the security group, because the evidence named the exact component.

IAM playbook: “Access Denied”

IAM failures feel mysterious until you remember the policy-evaluation logic, which is deterministic. By default everything is implicitly denied; an explicit Allow in any applicable policy grants access; and an explicit Deny in any policy overrides every allow. Multiple policy types apply at once: identity-based policies (on the user/role), resource-based policies (on the bucket, queue, function, KMS key), permission boundaries (a ceiling on what an identity can be granted), Service Control Policies (SCPs) (an Organizations-wide ceiling), and session policies. An action is allowed only if it survives every one of them.

The decisive read-only tools are the IAM Policy Simulator (does this principal’s effective policy allow this action on this resource?), the error message itself (User: arn:… is not authorized to perform: <action> on resource: <arn> tells you the exact action and ARN to grant), and CloudTrail, where a denied call records errorCode: AccessDenied along with the principal, the action, the resource, and often the reason in errorMessage (e.g. “with an explicit deny in a service control policy”).

Symptom Likely cause Diagnostic step Fix
AccessDenied / not authorized to perform No Allow for that action/resource in the identity policy Read the error’s action+ARN; run the Policy Simulator Add a least-privilege Allow for the exact action on the exact resource ARN
Has an Allow but still denied An explicit Deny somewhere (identity, resource, SCP, boundary) overrides it CloudTrail errorMessage names the deny source; Policy Simulator Remove/scope the deny; if it’s an SCP, fix it at the Organizations level
Allowed in one account, denied in another SCP on the target account’s OU, or missing cross-account trust Check the account’s SCPs; check the role’s trust policy Adjust the SCP; add the principal to the role’s trust policy (sts:AssumeRole)
AccessDenied calling a service from a role Permission boundary caps the role below the action it needs Compare the boundary with the action; Policy Simulator Widen the boundary (carefully) or grant the action within the boundary
Can assume a role but can’t act Session policy or the role’s own policy is too narrow Inspect the assumed-role session; check the role policy Broaden the role/session policy to the needed action; re-assume
Resource owner denies despite identity Allow Resource-based policy (bucket/KMS/queue) doesn’t allow the principal Read the resource policy; check Principal/Condition Add the principal to the resource policy; both sides must allow cross-account
Worked yesterday, denied today A recent policy change (SCP, boundary, or policy edit) CloudTrail for Put*Policy/Attach*/Detach* events Revert/fix the change; codify policies in IaC so edits are reviewed
MFA-conditioned action denied A Condition requires MFA (aws:MultiFactorAuthPresent) and the session lacks it Read the policy Condition; check the session Re-authenticate with MFA; or scope the condition correctly

A common trap: a role “has the policy” but a call still fails. CloudTrail’s errorMessage reads “explicit deny in a service control policy” — the identity policy was never the problem; an SCP on the OU blocks it for everyone in that account. Read the evidence; don’t attach AdministratorAccess to paper over a guardrail that is doing its job.

S3 playbook: “403 / Access Denied”

S3 403s are notorious because a single request is evaluated against several independent access controls, and any one can deny it. In rough order: the caller’s IAM identity policy, the bucket policy, the legacy ACL (now off by default under Object Ownership: bucket-owner-enforced), S3 Block Public Access (account- and bucket-level, which overrides any policy that would grant public access), the VPC endpoint policy if access is via a gateway endpoint, and KMS key permissions if the object is encrypted with SSE-KMS (you need kms:Decrypt on the key, not just s3:GetObject). Diagnose in that order, because an identity-policy gap and a KMS-permission gap both surface as 403.

The decisive tools are the error context (the request via CloudTrail’s S3 data events shows the principal and the denied operation), the bucket policy/Block Public Access settings in the console, and IAM Access Analyzer, which flags buckets exposed beyond the account and validates policies. Crucially, S3 returns 403 AccessDenied for a missing object too when the caller lacks s3:ListBucket, to avoid leaking existence — so a “403” can really be a “404 in disguise”.

Symptom Likely cause Diagnostic step Fix
403 AccessDenied reading an object IAM identity policy or bucket policy doesn’t allow s3:GetObject Check the identity policy and bucket policy for the object ARN (arn:…:bucket/*) Grant s3:GetObject on the object ARN; remember the /* for objects vs the bucket ARN for the bucket
403 on an object encrypted with SSE-KMS Missing kms:Decrypt on the CMK Check the KMS key policy/grants for the caller Add kms:Decrypt (and kms:GenerateDataKey for writes) on the key to the principal
Public/anonymous read returns 403 S3 Block Public Access is on (the secure default) Check account- and bucket-level Block Public Access Prefer a presigned URL or CloudFront + OAC; only relax BPA if truly required
AccessDenied but the object “doesn’t exist” Caller lacks s3:ListBucket, so 404 is masked as 403 Check for s3:ListBucket on the bucket ARN Grant s3:ListBucket on the bucket; verify the key actually exists
Cross-account access denied Bucket policy doesn’t allow the other account’s principal, or Object Ownership issue Read the bucket policy Principal; check Object Ownership Allow the external principal in the bucket policy; set bucket-owner-enforced or grant ownership
403 only from inside a VPC VPC gateway endpoint policy restricts the bucket/action Check the endpoint policy attached to the route’s endpoint Allow the bucket/action in the endpoint policy (it defaults to full, but may be locked down)
AccessDenied writing with ACL Request sends a canned ACL but bucket is bucket-owner-enforced Check Object Ownership; inspect the x-amz-acl header Drop the ACL header; rely on bucket policy (ACLs are disabled by default now)
Signature/403 SignatureDoesNotMatch Wrong region endpoint, clock skew, or wrong credentials Verify the bucket Region, client clock, and the access key Use the correct regional endpoint; fix NTP; use current credentials (prefer roles over keys)

The single most common S3 403: an identity has s3:GetObject but the object is SSE-KMS-encrypted and the principal lacks kms:Decrypt. S3 and KMS are separate authorization systems — granting the S3 action without the KMS permission denies the read every time. Heavily tested, heavily tripped over.

Lambda playbook: “errors, timeouts and throttling”

Lambda hides the server, so debugging shifts to CloudWatch Logs (every invocation writes a log group /aws/lambda/<function>, including the REPORT line with duration, billed duration, memory used and init duration), CloudWatch metrics (Errors, Throttles, Duration, ConcurrentExecutions, IteratorAge for stream sources), and AWS X-Ray for traces. Separate three different failure modes: the function errors (your code throws, or its execution role lacks a permission), the function times out (it ran past the configured timeout — often a downstream call with no timeout of its own), and the function is throttled (it hit a concurrency limit and returned 429/TooManyRequestsException). Cold starts are a latency symptom, not an error.

The fastest first move is to open the function’s log group and read the actual exception or the Task timed out after N seconds line, rather than guessing from a metric. The execution role is the usual culprit for AccessDenied inside a function — it needs both the permission for what the code calls and the basic logging permissions to even write to CloudWatch.

Symptom Likely cause Diagnostic step Fix
Function errors with AccessDenied Execution role lacks the permission for an AWS call the code makes CloudWatch Logs stack trace; check the execution role Add the action to the execution role (e.g. dynamodb:PutItem, s3:GetObject, kms:Decrypt)
Task timed out after N seconds Code exceeds the timeout, usually a downstream call that hangs Logs REPORT line; check the downstream call’s own timeout Set a client-side timeout on the downstream call; raise the function timeout only if genuinely needed
Throttling/429/TooManyRequestsException Hit the account/function concurrency limit CloudWatch Throttles and ConcurrentExecutions Raise the account quota, set reserved concurrency, or smooth the source (SQS, batching)
First/occasional calls slow (cold start) New execution environment init (large package, VPC ENI, heavy init) Logs Init Duration; check package size and VPC config Provisioned concurrency or SnapStart; slim the package; move init out of the handler
Function in a VPC can’t reach the internet/AWS APIs No NAT gateway for the private subnets, or missing VPC endpoints Check the function’s subnets’ routes; check endpoints Add a NAT gateway, or VPC endpoints for the services (S3/DynamoDB gateway, others interface)
Errors spike but code looks fine Unhandled exception, bad input, or a deployment regression Logs around the spike; CloudTrail for UpdateFunctionCode Fix the code/handler; roll back via versions/aliases; add input validation
Stream source lagging (IteratorAge climbing) Function too slow / failing on Kinesis/DynamoDB Streams, blocking the shard CloudWatch IteratorAge, Errors Speed up/parallelise; add a batch bisect on error/on-failure destination; increase shards
429 from API Gateway in front of Lambda API Gateway throttle/quota or downstream Lambda throttling API Gateway metrics; Lambda Throttles Adjust the usage plan/throttle; raise Lambda concurrency; cache

Tie this to deployments: many Lambda error spikes appear immediately after a deploy. The robust pattern is versions and aliases with weighted (canary) routing — shift a small percentage of traffic to the new version, watch the Errors and Duration alarms, then complete the shift or roll back instantly by pointing the alias back. Covered in Lambda Performance: Cold Starts, Provisioned Concurrency & SnapStart.

Hands-on lab: diagnose a deliberately broken EC2 instance

In this lab you will create a fault on purpose, then use the method to find and fix it. We’ll launch a tiny Free Tier instance, lock its security group so SSH is blocked, diagnose the block with the VPC Reachability Analyzer and read-only checks (never touching the instance), then fix it. Everything uses t2.micro/t3.micro (Free Tier eligible) and is deleted at the end. Run it in AWS CloudShell (Bash), which has the CLI and your credentials pre-configured.

1. Set variables and find a default VPC + subnet.

REGION=us-east-1
AMI=$(aws ssm get-parameters \
  --names /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64 \
  --query 'Parameters[0].Value' --output text --region $REGION)
VPC=$(aws ec2 describe-vpcs --filters Name=isDefault,Values=true \
  --query 'Vpcs[0].VpcId' --output text --region $REGION)
SUBNET=$(aws ec2 describe-subnets --filters Name=vpc-id,Values=$VPC \
  --query 'Subnets[0].SubnetId' --output text --region $REGION)
echo "AMI=$AMI VPC=$VPC SUBNET=$SUBNET"

2. Create a key pair and a security group that allows SSH from your IP.

aws ec2 create-key-pair --key-name ts-lab-key \
  --query 'KeyMaterial' --output text --region $REGION > ts-lab-key.pem
chmod 600 ts-lab-key.pem

MYIP=$(curl -s https://checkip.amazonaws.com)
SG=$(aws ec2 create-security-group --group-name ts-lab-sg \
  --description "TS lab" --vpc-id $VPC \
  --query 'GroupId' --output text --region $REGION)
aws ec2 authorize-security-group-ingress --group-id $SG \
  --protocol tcp --port 22 --cidr ${MYIP}/32 --region $REGION
echo "SG=$SG (SSH allowed from ${MYIP}/32)"

3. Launch a Free Tier instance with a public IP.

IID=$(aws ec2 run-instances --image-id $AMI --instance-type t2.micro \
  --key-name ts-lab-key --security-group-ids $SG --subnet-id $SUBNET \
  --associate-public-ip-address \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=ts-lab}]' \
  --query 'Instances[0].InstanceId' --output text --region $REGION)
aws ec2 wait instance-running --instance-ids $IID --region $REGION
PUBIP=$(aws ec2 describe-instances --instance-ids $IID \
  --query 'Reservations[0].Instances[0].PublicIpAddress' --output text --region $REGION)
echo "Instance $IID at $PUBIP"

Confirm SSH works (answer yes to the host-key prompt, then exit):

ssh -i ts-lab-key.pem ec2-user@$PUBIP 'echo connected; exit'

4. Break it. Revoke the SSH ingress rule — simulating “someone tightened the firewall and now I can’t get in”:

aws ec2 revoke-security-group-ingress --group-id $SG \
  --protocol tcp --port 22 --cidr ${MYIP}/32 --region $REGION

Now retry the SSH from step 3 — it hangs and times out. Resist the urge to terminate and relaunch. Apply the method.

5. Isolate the layer with the Reachability Analyzer (read-only). Ask “can the internet reach this instance on port 22?”:

IGW=$(aws ec2 describe-internet-gateways \
  --filters Name=attachment.vpc-id,Values=$VPC \
  --query 'InternetGateways[0].InternetGatewayId' --output text --region $REGION)
PATHID=$(aws ec2 create-network-insights-path \
  --source $IGW --destination $IID --protocol tcp --destination-port 22 \
  --query 'NetworkInsightsPath.NetworkInsightsPathId' --output text --region $REGION)
ANALYSIS=$(aws ec2 start-network-insights-analysis \
  --network-insights-path-id $PATHID \
  --query 'NetworkInsightsAnalysis.NetworkInsightsAnalysisId' --output text --region $REGION)
aws ec2 wait network-insights-analysis-available \
  --network-insights-analysis-ids $ANALYSIS --region $REGION
aws ec2 describe-network-insights-analyses \
  --network-insights-analysis-ids $ANALYSIS \
  --query 'NetworkInsightsAnalyses[0].{Reachable:NetworkPathFound,Blocker:Explanations[0].ExplanationCode}' \
  --output table --region $REGION

Expected output: Reachable = False, with a blocker pointing at the security group (e.g. an explanation code such as ENI_SG_RULES_MISMATCH). In read-only calls you’ve proven the fault is the security group (not the instance, not SSH, not your client) — the whole point of the method.

6. Corroborate by reading the effective security-group rules.

aws ec2 describe-security-groups --group-ids $SG \
  --query 'SecurityGroups[0].IpPermissions' --output table --region $REGION

The inbound rules are empty — there is no longer an allow for port 22. That’s the merged, real ruleset AWS is applying.

7. Fix the root cause (re-add the scoped rule) and verify by re-running the original reproduction:

aws ec2 authorize-security-group-ingress --group-id $SG \
  --protocol tcp --port 22 --cidr ${MYIP}/32 --region $REGION
# Verify from the user's perspective — the reproduction from step 3:
ssh -i ts-lab-key.pem ec2-user@$PUBIP 'echo reconnected; exit'

It connects again. You diagnosed and fixed a connectivity incident without ever logging into the instance — because the evidence pointed at the network layer.

8. Prevent (discuss). In production you’d codify the working security group in CloudFormation/Terraform so an ad-hoc revoke can’t drift in unreviewed, and add an EventBridge rule (or AWS Config rule) that alerts on AuthorizeSecurityGroupIngress/RevokeSecurityGroupIngress — guards covered in the complex-incidents and governance lessons.

Cleanup — delete everything so you pay nothing further:

aws ec2 terminate-instances --instance-ids $IID --region $REGION
aws ec2 wait instance-terminated --instance-ids $IID --region $REGION
aws ec2 delete-network-insights-analysis --network-insights-analysis-id $ANALYSIS --region $REGION
aws ec2 delete-network-insights-path --network-insights-path-id $PATHID --region $REGION
aws ec2 delete-security-group --group-id $SG --region $REGION
aws ec2 delete-key-pair --key-name ts-lab-key --region $REGION
rm -f ts-lab-key.pem

Cost note. A t2.micro/t3.micro instance is Free Tier eligible (750 hours/month for the first 12 months); run for a few minutes and the compute is free or a rounding error. The Reachability Analyzer charges a small per-analysis fee (cents) — well under ₹10 / ~US$0.10 for this lab — so delete the path/analysis as shown. Confirm the instance is terminated afterwards so nothing lingers on the bill.

Common mistakes & troubleshooting

The meta-mistakes — the errors people make while troubleshooting — cost more than any single misconfiguration:

Mistake Why it bites Do this instead
Changing several settings at once You can’t tell what fixed it (or what broke worse) Change one variable, test, then the next
Fixing the symptom, not the cause The incident recurs tomorrow Trace to root cause; capture a prevention (step 8)
Trusting the intended config over the effective one SGs/NACLs/routes/policies combine; what you set ≠ what’s applied Read effective SG rules, route tables, and run the Policy Simulator
Opening a security group to 0.0.0.0/0 to “test” Leaves a permanent hole; often isn’t even the cause Scope to your IP/32; use Session Manager/Instance Connect Endpoint
Attaching AdministratorAccess to end an AccessDenied Creates standing over-privilege; hides the real gap Read the error’s action+ARN; grant the least-privilege action
Confusing the S3 action with the KMS permission s3:GetObject without kms:Decrypt still returns 403 Grant the KMS permission for SSE-KMS objects
Forgetting NACLs are stateless Return traffic blocked even when the SG allows the request Allow both the request and the ephemeral-port response
Skipping “what changed?” You debug from scratch when an API call caused it Check CloudTrail Event history first

Best practices

Security notes

Troubleshooting under pressure is exactly when security hygiene erodes — guard against it:

Interview & exam questions

1. Walk me through how you troubleshoot an AWS issue. The loop: reproduce → isolate the layer → compare config vs desired → inspect CloudWatch/CloudTrail → hypothesise and test (one variable) → fix the root cause → verify by re-running the reproduction → prevent. Emphasise isolating the layer and read-only first.

2. An EC2 instance is unreachable on port 22. First moves? Check the security group inbound rules and the subnet NACL (remembering NACLs are stateless — the ephemeral response must be allowed too); confirm a public IP and an IGW route. The Reachability Analyzer computes the whole path and names the blocking component. Don’t reboot or open the SG to the world until the evidence points there.

3. Security group vs network ACL — the key differences. Security groups are stateful (return traffic auto-allowed), attach to ENIs, and have allow rules only. Network ACLs are stateless (you must allow both directions, including ephemeral ports), attach to subnets, are evaluated by rule number, and can have explicit deny rules. A stateless NACL blocking the response is a classic “the SG allows it but it still times out” cause.

4. Explain the IAM policy-evaluation logic. Default implicit deny; an explicit Allow in any applicable policy grants; an explicit Deny anywhere overrides every allow. The applicable policies are identity-based, resource-based, permission boundaries, SCPs and session policies — the request must be allowed by all of them (and denied by none). Explicit deny > allow > implicit deny.

5. A role “has the right policy” but still gets AccessDenied. Name three causes. (a) An explicit deny elsewhere — an SCP on the account’s OU, a permission boundary, or a resource policy; (b) a resource-based policy that doesn’t allow the principal (e.g. cross-account bucket/KMS); © the action/resource ARN in the policy doesn’t match what’s being called. CloudTrail’s errorMessage often names the deny source; the Policy Simulator confirms.

6. List the access controls an S3 request is evaluated against. IAM identity policy, bucket policy, ACLs (disabled by default under bucket-owner-enforced), S3 Block Public Access (overrides public grants), VPC endpoint policy (if via a gateway endpoint), and KMS key permissions for SSE-KMS objects. Any one can return 403.

7. An identity with s3:GetObject gets 403 on an encrypted object. Why? The object is SSE-KMS encrypted and the principal lacks kms:Decrypt on the key. S3 and KMS are separate authorization systems; you need both the S3 action and the KMS permission (plus kms:GenerateDataKey to write).

8. Distinguish a Lambda timeout, an error and a throttle, and the metric for each. A timeout = ran past the configured timeout (Task timed out; often a hanging downstream call) — watch Duration near the limit. An error = the code threw or the execution role lacks a permission — watch Errors and read the stack trace in CloudWatch Logs. A throttle = hit a concurrency limit (429/TooManyRequestsException) — watch Throttles and ConcurrentExecutions.

9. How do you reach a private EC2 instance with no public IP and no bastion? AWS Systems Manager Session Manager (needs the SSM agent and an instance role with the SSM managed policy, plus network to the SSM endpoints) or an EC2 Instance Connect Endpoint — both give shell access with no inbound port and no public IP, which is also the more secure pattern in general.

10. A private subnet’s instances can’t reach the internet. What do you check, and what’s the fix? The subnet’s route table for 0.0.0.0/0 → a NAT gateway, and that the NAT itself sits in a public subnet with an IGW route and an Elastic IP. Reachability Analyzer will name a missing route. Fix the route/NAT, not the security group, unless the evidence says otherwise.

11. How do you answer “who changed this, and when?” AWS CloudTrail — its Event history (90 days, no setup) and any configured trail record every API call: the principal, the action, the source IP, the parameters and the result. Filter for the relevant Modify*/Put*/Delete* events around the incident start.

12. What turns a junior troubleshooter into a senior one? The last step: prevention. Juniors fix the symptom; seniors trace the root cause and leave behind a CloudWatch alarm, an AWS Config rule, an SCP guardrail, infrastructure-as-code, or a runbook so it can’t silently recur — and they reason by isolating the layer instead of guessing.

Quick check

  1. In the eight-step method, which step is the “master move” that saves the most time, and what does it determine?
  2. An EC2 instance’s security group allows inbound SSH, but the connection still times out. Name the stateless component that could be the cause and why.
  3. You get AccessDenied calling an API from a role that has an Allow for that action. Where do you look first to find what’s overriding the allow, and what could it be?
  4. An object read returns 403 even though the caller has s3:GetObject. Give the most common reason on an encrypted bucket.
  5. A Lambda function returns 429 TooManyRequestsException. Which failure mode is this, and which CloudWatch metric confirms it?

Answers

  1. Isolate the layer (step 2). It determines which layer is failing — identity, network, DNS, the resource, or the app — so you fix the right thing instead of the first thing.
  2. The subnet’s network ACL (NACL). NACLs are stateless, so even when the security group allows the inbound request, a NACL that doesn’t allow the outbound ephemeral-port response (1024–65535) drops the return traffic and the connection times out.
  3. CloudTrail — the denied call’s errorMessage often names the deny source (e.g. “explicit deny in a service control policy”); the IAM Policy Simulator corroborates. The override is an explicit Deny: an SCP on the OU, a permission boundary, or a resource-based policy.
  4. The object is SSE-KMS encrypted and the principal lacks kms:Decrypt on the KMS key. S3 and KMS are separate authorization systems — the S3 action alone isn’t enough.
  5. A throttle — the function hit its (account or reserved) concurrency limit. The Throttles metric (alongside ConcurrentExecutions) confirms it.

Exercise

You’re handed this incident cold: “Our order-processing Lambda, triggered by an SQS queue, started failing at 09:10. The queue is backing up and the dead-letter queue is filling. The app team swears nothing changed in the code.”

Work it with the method and write down, for each step, what you would do and why:

  1. Reproduce — how do you make the failure observable on demand (hint: a test invocation with a captured SQS message payload)?
  2. Isolate the layer — walk the five layers; for this symptom (a queue-triggered function failing, DLQ filling), which layers are most likely and which can you quickly rule out?
  3. Config vs desired / what changed — where do you look to test “nothing changed” (think beyond the function’s own code)?
  4. Inspect — name the two or three read-only diagnostics you’d run (hint: one is the function’s CloudWatch log group, one is a metric, one is CloudTrail) and what each result would tell you.
  5. Hypothesise & test, fix, verify, prevent — state your single most likely hypothesis, the one test that confirms it, the fix, how you’d verify from the queue’s perspective, and the prevention you’d leave behind.

A strong answer recognises that a queue-triggered function failing points hardest at either an execution-role permission that was changed (read the log stack trace; check CloudTrail for Put*Policy/SCP changes), a downstream dependency timing out (the Duration/timeout signal), or a throttle (the Throttles metric); that CloudTrail is where you verify “nothing changed” even when the code didn’t; that the fix is the least-privilege grant or a client-side timeout, not raising every limit blindly; and that the prevention is an alarm on Errors/IteratorAge plus codifying the role in IaC, with the DLQ giving you a safe replay once fixed.

Certification mapping

This lesson maps to SOA-C02: AWS Certified SysOps Administrator – Associate, chiefly the Troubleshooting and Optimization and Monitoring, Logging, and Remediation domains:

The method here is also what SAA-C03 and SAP-C02 — and real architecture interviews — probe when they ask how you’d approach an unfamiliar failure. The companion complex-incidents lesson takes it to multi-service root-cause analysis.

Glossary

Next steps

You now have a method that works on any AWS failure and playbooks for the services that break most often. The natural next move is to take that same method up a level — to incidents that span several services at once and demand correlation across signals:

Related reading:

AWSTroubleshootingCloudWatchCloudTrailIAMSOA-C02
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading