Azure Troubleshooting Playbooks: Network, VM, Identity, Storage & Apps

There is a particular kind of panic that hits when something in Azure breaks in production. A web app starts returning 500s; a virtual machine you could reach yesterday refuses RDP; a perfectly valid storage request comes back 403; a colleague who definitely has access is told they don’t. The temptation is to start clicking — toggle a setting, restart the VM, regenerate a key, add an Owner role assignment — and hope. That is gambling, not troubleshooting, and it is how a five-minute incident becomes a two-hour one with three new misconfigurations layered on top of the original fault.

This lesson teaches the opposite habit: a repeatable method that turns “it’s broken and I don’t know why” into a short, ordered set of questions that converge on the real cause — plus per-domain playbooks (networking, VM, identity, storage, App Service) mapping a symptom to its likely cause, the one diagnostic that confirms it, and the fix. A senior engineer is not someone who memorises every error code; it is someone who can take an unfamiliar failure and narrow it down calmly. By the end you will have that instinct, plus a reference to keep open during an incident.

This is a methodology lesson — it teaches how to think. The next lesson, The Azure Diagnostics Toolkit, teaches the tools (Network Watcher, Resource Health, boot diagnostics, serial console, KQL) hands-on. Here we cover enough of each tool to act; there we go deep.

Learning objectives

By the end of this lesson you can:

Apply an eight-step troubleshooting method — reproduce → isolate the layer → compare config vs desired → inspect logs and metrics → form and test a hypothesis → fix → verify → prevent — to any Azure incident.
Isolate the failing layer quickly (identity? network? the resource itself? the application?) instead of fixing the wrong thing.
Diagnose the most common networking failures using effective security rules, effective routes, next hop and DNS resolution.
Diagnose a VM that won’t boot or won’t accept RDP/SSH using boot diagnostics, the serial console and run-command.
Diagnose identity and RBAC failures using sign-in logs, Check access / effective permissions and Conditional Access report-only mode.
Diagnose storage 403/auth and App Service 5xx failures, and read a decision tree that routes a symptom to the right playbook.
Capture each fix as a prevention (an alert, a policy, a runbook) so the same incident does not recur.

Prerequisites & where this fits

You should already understand Azure’s core building blocks: subscriptions and resource groups, virtual networks and NSGs, virtual machines, Microsoft Entra ID and RBAC, and storage accounts — all covered in earlier lessons. You needn’t be an expert in any; troubleshooting is precisely the skill of reasoning about a system you only partly understand. This is Lesson G7, the first of two in the Operations & Troubleshooting module. It pairs with Azure Service Health, Advisor & Resource Graph (is it me or is it Azure?) and leads into the diagnostics-tools deep dive. Everything here maps to AZ-104, where “monitor and troubleshoot Azure resources” is an exam domain.

The troubleshooting mindset: eight steps that always work

Tools change; the method does not. Whether you are debugging a 2003-era on-prem server or a 2026 Azure landing zone, the same loop applies. Internalise these eight steps and you will never again be the person frantically toggling settings.

#	Step	The question it answers	Why it matters
1	Reproduce	Can I make it fail on demand?	A fault you can’t reproduce, you can’t confirm you’ve fixed. Pin down exactly who, what, where, when.
2	Isolate the layer	Which layer is actually failing — identity, network, the resource, or the app?	This is the master move. Most wasted time comes from fixing the wrong layer.
3	Config vs desired	Does the current configuration match what I intended?	Most Azure incidents are a config drift or a recent change, not a platform fault.
4	Inspect logs & metrics	What does the evidence say?	Logs and metrics are ground truth. Read them before theorising, not after.
5	Hypothesise & test	What’s my single best guess, and what one test confirms or kills it?	One variable at a time. A test that can only “succeed” proves nothing.
6	Fix	What is the smallest change that addresses the root cause?	Fix the cause, not the symptom; change one thing so you know what worked.
7	Verify	Is it actually fixed — from the user’s perspective?	Re-run the reproduction from step 1. “It looks fine in the portal” is not verification.
8	Prevent	How do I make sure this never silently recurs?	Turn the fix into an alert, a policy, a runbook, or a test. This is what makes you senior.

A few principles make the loop sharper:

Change one thing at a time. Flip three settings and it starts working, and you’ve learned nothing — and maybe introduced two new problems. Revert speculative changes that didn’t help.
Read before you write. Every diagnostic in this lesson is read-only — it inspects state without changing it. Exhaust the read-only checks before you touch anything.
Believe the evidence, not the assumption. “But it should work” is the most expensive phrase in operations. The effective rules, the actual logs, the real DNS answer — those are reality.
Ask “what changed?” first. The Azure Activity Log is a per-subscription audit of every control-plane write (who deployed, modified or deleted what, and when); a fault that started at 14:05 next to a deployment at 14:03 is rarely a coincidence.

Isolating the layer — the master skill

Step 2 deserves its own model because it is where time is won or lost. Almost every Azure failure lives in one of five layers. Ask the questions top to bottom and you will usually localise the fault in under a minute:

Layer	“Is the problem here?” — quick test	Typical symptoms
Identity / authorization	Can the caller authenticate, and do they have the role for this action?	`AuthorizationFailed`, `403` on the control plane, “you don’t have access”, sign-in blocked
Network / connectivity	Can the packet physically reach the target on the right port?	Timeouts, “connection refused”, can’t RDP/SSH, intermittent drops
DNS / name resolution	Does the name resolve to the IP you expect?	“host not found”, connecting to the public IP of a private resource
The resource / platform	Is the resource itself healthy and running?	VM won’t boot, service degraded, `503` from a stopped backend
The application	Is the code/config inside the resource the problem?	App-level `500`, stack traces, bad connection string, failed dependency

The trick is to test the cheapest, most likely layer first and to bisect: if a request fails from one machine but succeeds from another, the difference between them is your fault. If the same request fails everywhere, the problem is central (the resource or its config), not the caller.

Azure troubleshooting decision tree

The decision tree above is the same logic rendered as a flowchart: start from the symptom, ask “can it authenticate?”, then “can the packet arrive?”, then “does the name resolve?”, then “is the resource healthy?”, and finally “is it the app?” — branching to the matching playbook below at each point.

Networking playbook: “I can’t connect to it”

Connectivity is the single biggest source of Azure incidents, and almost all come down to four things: an NSG (network security group) rule blocking traffic, a UDR (user-defined route) sending the packet somewhere unexpected, DNS resolving the wrong address, or the target service not listening. Beginners stare at the VM; the fix is to inspect the effective configuration Azure actually applied, because NSGs combine at subnet and NIC level and routes combine system + BGP + UDR.

Three Network Watcher checks resolve most cases. IP flow verify answers “would an NSG allow this exact 5-tuple (source IP, dest IP, port, protocol, direction)?” and names the deciding rule. Next hop answers “where does a packet to this destination actually go — internet, VNet, a virtual appliance, or a black hole?”. Effective security rules and effective routes (on the NIC) show the merged, real ruleset rather than what you think you configured.

Symptom	Likely cause	Diagnostic step	Fix
Connection times out (no response)	NSG denying inbound on the port	Network Watcher → IP flow verify for the dest port; or read effective security rules on the NIC	Add/adjust an inbound `Allow` rule at the right priority (lower number wins); narrow source to your IP, not `Any`
Connection refused (fast reject)	The service isn’t listening / wrong port / firewall inside the OS	Confirm the app is bound (`ss -tlnp` / `netstat`); check the OS firewall	Start the service / bind the right port; open the host firewall (Windows Firewall, `ufw`)
Traffic silently disappears	A UDR forces `0.0.0.0/0` to a firewall/NVA that drops or isn’t up	Network Watcher → Next hop for the destination	Fix the route table, ensure the NVA/Azure Firewall is healthy, or correct the next-hop type
Reaches the wrong server, or “host not found”	DNS resolving to a stale/public IP	`nslookup`/`dig` the name from the client; check the VNet’s DNS servers and any Private DNS zone	Point the VNet at the right DNS; fix the A record / Private DNS zone link
Works from one subnet, fails from another	NSG/route differs, or peering lacks `AllowForwardedTraffic`/gateway transit	Compare effective rules/routes on both NICs; check peering settings	Align the NSG/UDR; enable the required peering options
Can’t reach a PaaS resource over private endpoint	Private DNS not resolving the privatelink name; still using public IP	`nslookup` the resource FQDN — it should return a private IP	Link the `privatelink.*` Private DNS zone to the VNet; add the A record
Intermittent outbound failures under load	SNAT port exhaustion on the outbound path	Check Load Balancer/NAT gateway SNAT metrics	Add a NAT gateway for deterministic egress; reuse connections in the app
Can’t RDP/SSH at all from the internet	Management port exposed-then-blocked, or you should use Bastion	IP flow verify on 3389/22; check for a deny rule	Connect via Azure Bastion (no public IP needed); see the Bastion lesson

A grounding example: a VM is unreachable on port 443. IP flow verify returns Allow, rule AllowHTTPS — so the NSG is not the problem; you’ve eliminated a whole layer in one command. Next hop returns VirtualAppliance pointing at a firewall that’s showing unhealthy. There’s your cause, found in two read-only checks, no guessing.

Virtual machine playbook: “it won’t boot / I can’t log in”

When a VM misbehaves, separate two very different failures: the VM won’t start (a platform/OS-boot problem) versus the VM is running but you can’t connect (almost always network or in-guest config — see the networking playbook for the connectivity half). Two tools are your superpowers: boot diagnostics (a platform-captured screenshot and serial log of the boot, so you can see a kernel panic, a Windows recovery screen or a stuck fsck without logging in) and the serial console (a keyboard into the VM’s serial port via the platform, working with no network, NSG path or RDP/SSH). For changes from outside the OS, run-command runs a script as administrator/root over the control plane — no SSH required.

Symptom	Likely cause	Diagnostic step	Fix
VM stuck / won’t boot	OS corruption, bad fstab, failed update, full OS disk	Boot diagnostics screenshot + serial log	Serial console to edit fstab/grub; or attach OS disk to a rescue VM and repair
Provisioning/agent timeout, VM “running” but unmanageable	Azure VM Agent stopped or unhealthy	Check VM Agent status; review extension provisioning state	Restart the agent (serial console / run-command); reinstall the agent
Can’t RDP (Windows)	NSG, or RDP service/firewall in-guest, or expired credentials	IP flow verify 3389; boot diagnostics for a logon screen; run-command	Fix NSG; via run-command re-enable RDP / reset firewall; Reset password (VMAccess extension)
Can’t SSH (Linux)	NSG, `sshd` down, bad `authorized_keys`/permissions	IP flow verify 22; serial console login	Restart `sshd`; fix key/permissions; Reset SSH key (VMAccess extension)
Boots then reboots in a loop	Failed driver/update, kernel mismatch, disk pressure	Serial log across reboots	Boot previous kernel via serial console; roll back the update
Extension fails (`ProvisioningState/failed`)	Script error, dependency, network to download source	Read extension status message; check `/var/log/azure` or `C:\WindowsAzure\Logs`	Fix the script/dependency; remove and re-add the extension
VM unexpectedly deallocated, or “out of capacity” on start	Spot eviction, or no capacity in the size/zone	Activity Log; check eviction events	Use a different size/zone; for steady workloads avoid spot; consider capacity reservations
Performance cliff (slow disk)	Hitting disk IOPS/throughput cap or VM cap	VM/disk metrics (IOPS, throughput, credits)	Resize disk tier (e.g. Premium SSD v2), enable bursting, or resize the VM

Note the deallocate vs stop trap: stopping a VM from inside the OS leaves it allocated — you keep paying for compute. Only deallocating (portal “Stop” or az vm deallocate) releases the compute charge. A VM that “won’t start” after a stop is sometimes just hitting a capacity constraint on reallocation.

Identity & RBAC playbook: “access denied”

Authorization failures split into two questions, and conflating them wastes time. First: can the principal sign in at all (authentication)? A blocked or challenged sign-in is Microsoft Entra ID, usually Conditional Access. Second: once signed in, are they allowed this specific action (authorization)? That’s RBAC — the AuthorizationFailed / 403 on a control-plane operation. Different tools answer each.

For authentication, the Entra sign-in logs are definitive: every sign-in records success/failure, the failure reason, the IP and device, and which Conditional Access policy applied and what it required (MFA, compliant device, block). For authorization, Check access on the resource shows exactly which roles a principal has at that scope — remembering RBAC is additive and inherited down management group → subscription → resource group → resource, and a Deny assignment or Azure Policy can override an Allow.

Symptom	Likely cause	Diagnostic step	Fix
`AuthorizationFailed` / `403` on a portal action	Missing RBAC role at this scope	Resource → Access control (IAM) → Check access for the user	Assign the least-privilege role (e.g. Contributor on the RG, not Owner on the sub)
Has a role but still denied	Deny assignment, Azure Policy `deny`, or scope mismatch	Check access shows the deny; review Policy compliance	Adjust the policy/exemption; assign the role at the correct scope
Sign-in blocked entirely	Conditional Access policy (location, device, risk)	Entra → Sign-in logs → the entry’s Conditional Access tab	Adjust the CA policy or satisfy it (trusted location, compliant device); test in report-only first
Prompted for MFA unexpectedly / can’t complete MFA	CA requiring MFA; no registered method	Sign-in logs → CA result; check Authentication methods	Register a method; scope/adjust the CA policy; check number-matching
App/script gets `401`/`403` (not a human)	Managed identity not assigned the role, or wrong identity used	Confirm the identity’s role with Check access; verify the token’s `oid`/`appid`	Grant the managed identity the role; ensure the app requests the right identity
Role assignment “done” but no effect	RBAC propagation delay, or assigned at wrong scope	Re-check access; wait a few minutes; sign out/in for a fresh token	Re-assign at correct scope; refresh the token (token caches old claims)
Guest (B2B) user can’t access	External collaboration / CA settings; not invited to resource	Sign-in logs for the guest; Check access	Fix invitation/cross-tenant settings; assign the role to the guest
PIM-eligible admin lacks access right now	Role is eligible, not active	Check PIM assignments	Activate the eligible role (PIM) for the session

A common trap: someone is “definitely an Owner” but gets AuthorizationFailed. Check access reveals they’re Owner on a different resource group, or their role is eligible via PIM and not activated, or a cached token predates the assignment. Read the evidence; don’t escalate privileges to paper over a scope mistake.

Storage playbook: “403 / authentication failed”

Storage 403s feel mysterious because a request passes four independent gates, and any one returns 403. In order: firewall (does the network allow this caller?), authorization (is the credential — account key, SAS, or Entra/RBAC — valid and sufficient?), request validity (SAS expired, clock skewed, permission missing?), and for private access, DNS (does the endpoint resolve to the private endpoint?). Diagnose in that order, because a network block and an auth failure look identical from the client.

Symptom	Likely cause	Diagnostic step	Fix
`403 AuthorizationFailure` from an allowed identity	Storage firewall blocks the caller’s network	Storage → Networking: is it “selected networks”? Is the caller’s IP/VNet listed?	Add the IP/subnet (service or private endpoint), or use a trusted-services exception
`403 AuthenticationFailed` with account key	Wrong/rotated key, or clock skew on the client	Verify the key; check the client’s time is in sync	Use the current key (prefer Entra auth over keys); fix NTP
`403` from a SAS URL	SAS expired, wrong permissions (e.g. read-only used for write), or wrong signed resource/IP	Decode the SAS: check `se` (expiry), `sp` (perms), `sip`, `spr`	Reissue with correct perms/expiry; prefer user-delegation SAS (Entra-backed)
`403` doing data ops with RBAC	Has `Reader` (control plane) but not a data role	Check access: look for Storage Blob Data roles	Assign Storage Blob Data Reader/Contributor — control-plane roles don’t grant data access
Connects to the public endpoint despite a private endpoint	Private DNS not resolving the `privatelink` name	`nslookup account.blob.core.windows.net` — expect a private IP	Link the `privatelink.blob.core.windows.net` zone to the VNet; add the A record
`409`/`PublicAccessNotPermitted` on anonymous blob read	Anonymous/public access disabled (the secure default)	Check allow blob anonymous access setting	Use SAS or Entra auth; only enable anonymous if truly required
Intermittent throttling (`503`/`500`)	Hitting account scalability targets (IOPS/egress)	Storage metrics (transactions, throttled requests)	Spread load, use multiple accounts, or move to premium; back off and retry

The single most common storage 403: an identity has Reader on the account and tries to read a blob, and is denied. Reader is a control-plane role — it lets you see the account exists. Reading data needs a Storage Blob Data role. Control plane and data plane are separate authorization systems — heavily tested, heavily tripped over.

App Service playbook: “the app returns 5xx”

App Service hides the VM, so debugging shifts to streaming logs, the Kudu/SCM advanced-tools site, and Application Insights. Distinguish platform 5xx from application 5xx: a 503 often means the app failed to start, was stopped, or is throttled/cold; a 500 is usually your code throwing. The fastest first move is Log stream (live stdout/stderr and web-server logs) to see the actual exception or startup error rather than guess from the status code.

Symptom	Likely cause	Diagnostic step	Fix
`503 Service Unavailable`	App failed to start, is stopped, or plan is overloaded/cold	Log stream; check the app is Running; Resource Health	Fix the startup error (logs); scale up/out; enable Always On to avoid cold idle
`500.30` / container won’t start	Bad startup command, missing dependency, wrong port	Log stream + Kudu (`/api/logs/docker`); check the listening port	Correct startup command/port; set `WEBSITES_PORT`; fix dependency
App `500` with stack trace	Application code exception	Application Insights → Failures; Log stream	Fix the code/config; add the missing app setting
Works locally, fails in Azure	Missing app setting/connection string, or runtime version mismatch	Compare Configuration app settings; check stack version	Add settings (they become env vars); pin the correct runtime version
`403`/`401` to a downstream resource	App’s managed identity lacks the role, or Key Vault reference broken	Check access on the target; check Key Vault reference status	Grant the identity the role; fix the `@Microsoft.KeyVault(...)` reference
Slow / timeouts after deploy	Cold start, under-scaled plan, or SNAT exhaustion to a DB	App Insights performance; check SNAT/connection metrics	Always On + scale; reuse connections; integrate a NAT gateway/VNet
`502 Bad Gateway` intermittently	Worker crashing/recycling, or downstream timeout	Log stream during the failure; check process restarts	Fix the crash; raise downstream timeouts; right-size the plan
Custom domain / TLS errors	Missing binding, expired cert, or DNS not pointing at the app	Check Custom domains + binding; `nslookup` the CNAME	Fix the binding/cert; correct the CNAME/A record

Tie this to deployments: many App Service 5xxs appear immediately after a deploy. The robust pattern is deployment slots with a swap — warm up the new version on a staging slot, verify, then swap into production, so a bad build never takes the live site down (and swap-back is instant rollback). Covered in Secure, zero-downtime deployments on App Service.

Hands-on lab: diagnose a deliberately broken VM

In this lab you will create a fault on purpose, then use the method to find and fix it. We’ll build a small VM, lock its NSG so SSH is blocked, diagnose the block with Network Watcher (never touching the VM), then fix it. Everything uses standard/B-series resources and is deleted at the end. Run it in Cloud Shell (Bash).

1. Set up variables and a resource group.

RG=rg-ts-lab
LOC=eastus
az group create -n $RG -l $LOC -o table

2. Create a small Linux VM (this also creates a VNet, subnet, NIC, public IP and a default NSG).

az vm create \
  -g $RG -n vm-ts \
  --image Ubuntu2204 \
  --size Standard_B1s \
  --admin-username azureuser \
  --generate-ssh-keys \
  --public-ip-sku Standard \
  -o table

Note the publicIpAddress in the output. Confirm SSH works (answer yes to the host-key prompt, then exit):

ssh azureuser@<publicIpAddress> 'echo connected; exit'

3. Break it. Add a high-priority rule that denies inbound SSH — simulating “someone tightened the firewall and now I can’t get in”. The NSG is vm-tsNSG by default:

az network nsg rule create \
  -g $RG --nsg-name vm-tsNSG \
  -n DenySSH --priority 100 \
  --direction Inbound --access Deny \
  --protocol Tcp --destination-port-ranges 22 \
  --source-address-prefixes '*' -o table

Now retry the SSH from step 2 — it hangs and times out. Resist the urge to recreate the VM. Apply the method.

4. Isolate the layer with Network Watcher (read-only). Get the NIC ID, then ask “would an NSG allow inbound SSH?” with IP flow verify:

NIC=$(az vm show -g $RG -n vm-ts --query 'networkProfile.networkInterfaces[0].id' -o tsv)
PRIV=$(az network nic show --ids $NIC --query 'ipConfigurations[0].privateIPAddress' -o tsv)

az network watcher test-ip-flow \
  -g $RG \
  --nic $NIC \
  --direction Inbound --protocol TCP \
  --local "$PRIV:22" --remote "1.2.3.4:12345" \
  -o table

Expected output: access = Deny, ruleName = ...DenySSH. In one read-only command you’ve proven the fault is the NSG (not the VM, not SSH, not your client) and named the offending rule — the whole point of the method.

5. Confirm with effective rules (optional, corroborating evidence).

az network nic list-effective-nsg --ids $NIC \
  --query "value[].effectiveSecurityRules[?destinationPortRange=='22']" -o table

You’ll see DenySSH sitting above the default allow — the merged, real ruleset Azure applied.

6. Fix the root cause (remove the bad rule) and verify by re-running the original reproduction:

az network nsg rule delete -g $RG --nsg-name vm-tsNSG -n DenySSH -o table
# Verify from the user's perspective — the reproduction from step 2:
ssh azureuser@<publicIpAddress> 'echo reconnected; exit'

It connects again. You diagnosed and fixed a connectivity incident without ever logging into the VM — because the evidence pointed at the network layer.

7. Prevent (discuss). In production you’d codify the working NSG in Bicep/Terraform so an ad-hoc deny can’t drift in unreviewed, and add an Activity Log alert on NSG rule changes — guards covered in the diagnostics and policy lessons.

Cleanup — delete everything so you pay nothing further:

az group delete -n $RG --yes --no-wait

Cost note. A Standard_B1s VM plus a Standard public IP runs a few US cents per hour; finish and delete within the hour and the total is a rounding error (typically under ₹10 / ~US$0.10). The Network Watcher checks are effectively free. --no-wait returns immediately while Azure deletes in the background — confirm the resource group is gone afterwards so nothing lingers on the bill.

Common mistakes & troubleshooting

The meta-mistakes — the errors people make while troubleshooting — cost more than any single misconfiguration:

Mistake	Why it bites	Do this instead
Changing several settings at once	You can’t tell what fixed it (or what broke worse)	Change one variable, test, then the next
Fixing the symptom, not the cause	The incident recurs tomorrow	Trace to root cause; capture a prevention (step 8)
Trusting the intended config over the effective config	NSGs/routes merge; what you set ≠ what’s applied	Read effective rules/routes, sign-in logs, real DNS
Confusing authentication with authorization	You add roles when sign-in is the block (or vice versa)	Sign-in logs for auth; Check access for RBAC
Confusing control plane with data plane (storage)	`Reader` won’t read a blob; you grant the wrong role	Use Storage Blob Data roles for data ops
Forgetting RBAC propagation / token caching	“I assigned the role and it still fails”	Wait a few minutes; sign out/in for a fresh token
“Stopping” a VM in-guest and still being billed	Compute stays allocated	Deallocate (portal Stop / `az vm deallocate`) to stop compute charges
Skipping “what changed?”	You debug from scratch when a deploy caused it	Check the Activity Log first

Best practices

Lead with read-only diagnostics. IP flow verify, next hop, effective rules/routes, sign-in logs, Check access, log stream — all inspect without mutating. Exhaust them first.
Bisect to localise. Works here but not there? The difference is your fault. Same failure everywhere? It’s central.
Keep the playbooks at hand. Match the symptom to the table, run the one diagnostic, apply the fix — don’t improvise under pressure.
Codify the good state. Infrastructure as code (Bicep/Terraform) makes “config vs desired” trivial — what-if/plan shows drift instantly, and re-applying is the fix.
Close the loop with prevention. Every incident should leave behind an alert, policy, runbook entry, or test. An incident you don’t prevent is one you’ll repeat.
Write it down. A two-line symptom → root cause → fix note is the seed of your team’s runbook and the fastest path for the next person (often future-you).

Security notes

Troubleshooting under pressure is exactly when security hygiene erodes — guard against it:

Never “fix” access by granting Owner. Over-privileging to end an incident is how standing privilege accumulates. Diagnose the real scope gap, grant the least-privilege role, and prefer PIM just-in-time activation for admin actions.
Prefer Entra auth over keys/SAS. For a storage 403, the durable fix is usually moving off account keys to Entra + RBAC (and user-delegation SAS), not regenerating a key — rotated keys break every other consumer anyway.
Don’t widen the network to make it work. Setting an NSG source to Any or exposing RDP/SSH to “test” leaves a hole. Scope to your IP, and connect via Azure Bastion so management ports need no public exposure.
Treat logs as sensitive. Sign-in, Activity and app logs hold IPs, principals and sometimes payloads. Restrict who can read them; don’t paste them into untrusted places.
Revert speculative changes. Anything you loosened to diagnose and that didn’t help must go back — left-behind diagnostic changes are a classic source of the next incident.

Interview & exam questions

1. Walk me through how you troubleshoot an Azure issue. The loop: reproduce → isolate the layer → compare config vs desired → inspect logs/metrics → hypothesise and test (one variable) → fix the root cause → verify by re-running the reproduction → prevent. Emphasise isolating the layer and read-only first.

2. A VM is unreachable on port 443. First move? Network Watcher IP flow verify for inbound TCP/443 — it returns Allow/Deny and the deciding rule. If Allow, the NSG is exonerated; check next hop (a UDR to a dead NVA?), the in-guest service/firewall, and DNS. Don’t touch the VM until evidence points there.

3. Distinguish authentication from authorization failures, and the tool for each. Authentication = can you sign in (Entra, often Conditional Access) — use sign-in logs. Authorization = are you allowed this action (RBAC, the AuthorizationFailed/403) — use Check access. Conflating them is the classic error.

4. A user with Reader on a storage account gets 403 reading a blob. Why? Reader is a control-plane role — it shows the account exists but grants no data access. Reading blobs needs a data-plane role like Storage Blob Data Reader/Contributor; the two are separate authorization systems.

5. Name the four gates a storage request passes, in order. Firewall (network allowed?) → authorization (key/SAS/RBAC valid and sufficient?) → request validity (SAS expiry/perms, clock skew) → DNS (private-endpoint resolution). Any one returns 403; diagnose in that order.

6. How do you log into a VM with no working network and no RDP/SSH? The serial console — a keyboard into the VM’s serial port via the platform, independent of NSGs/network — backed by boot diagnostics to see the boot state. For external changes, run-command runs a script with admin/root rights over the control plane.

7. Stopping vs deallocating a VM — the difference, and why it matters? Stopping in-guest leaves the VM allocated — you keep paying for compute. Deallocating (portal Stop / az vm deallocate) releases the compute charge (disks still billed). A “stopped” VM still on the bill, or a deallocated VM that won’t restart due to capacity, are common surprises.

8. App Service returns 503. How do you approach it? 503 usually means the app failed to start, is stopped, or is cold/throttled. First: Log stream to read the actual startup error; confirm it’s Running; check Resource Health and Application Insights failures. Fixes: correct the startup error, enable Always On, scale, add the missing app setting.

9. You assigned an RBAC role but the user still can’t access. Three reasons. (a) Wrong scope; (b) propagation delay / cached token with old claims (sign out/in); © a Deny assignment or Azure Policy overrides the Allow. Check access reveals which.

10. A private-endpoint resource is reached on its public IP — what’s wrong, and how do you confirm? The Private DNS zone (e.g. privatelink.blob.core.windows.net) isn’t linked to the VNet or lacks the A record, so the name resolves to the public IP. Confirm with nslookup of the FQDN — it should return a private IP. Fix by linking the zone and adding the record.

11. How do you safely test a Conditional Access change you suspect is blocking sign-ins? Report-only mode: the policy is evaluated and logged in the sign-in logs (you see whether it would grant/block/require MFA) without enforcing it — so you validate before turning it on.

12. What turns a junior troubleshooter into a senior one? The last step: prevention. Juniors fix the symptom; seniors trace the root cause and leave behind an alert, a policy, infrastructure-as-code, or a runbook so it can’t silently recur — and they reason by isolating the layer instead of guessing.

Quick check

In the eight-step method, which step is the “master move” that saves the most time, and what does it determine?
A storage request from an identity you know has the right role still returns 403. Name two layers that could cause this and how you’d tell them apart.
You can’t SSH to a VM. Which single Network Watcher check tells you whether an NSG is responsible — and what extra information does it give you?
A user “is an Owner” but gets AuthorizationFailed. Give two plausible explanations.
True or false: stopping a VM from inside its operating system stops you being billed for compute. Explain.

Answers

Isolate the layer (step 2). It determines which layer is failing — identity, network, DNS, resource, or app — so you fix the right thing instead of the first thing.
The firewall (caller’s network not allowed) and authorization (credential/role insufficient — e.g. a control-plane role used for a data-plane op). Tell them apart via Networking on the account (is the caller’s IP/VNet allowed?) versus Check access for the right Storage Blob Data role; a private-endpoint DNS miss is a third possibility.
IP flow verify — it returns Allow/Deny for the exact 5-tuple and names the deciding rule, confirming whether the NSG is the cause and which rule to change.
Owner at a different scope (another resource group); role is eligible via PIM, not activated; or a cached token predates the assignment (sign out/in). A Deny assignment/Policy could also override it.
False. Stopping in-guest leaves the VM allocated — compute charges continue. You must deallocate (portal “Stop” or az vm deallocate) to release the compute charge (disks remain billed).

Exercise

You’re handed this incident cold: “Our internal API on api.internal.contoso.com, hosted on a VM behind a private endpoint, started returning errors at 09:10. Users get ‘connection timed out’. The app team swears nothing changed.”

Work it with the method and write down, for each step, what you would do and why:

Reproduce — what exactly do you try, and from where (inside the VNet vs outside)?
Isolate the layer — walk the five layers; for this symptom (“connection timed out” to a private-endpoint name), which layers are most likely and which can you quickly rule out?
Config vs desired / what changed — where do you look to test “nothing changed”?
Inspect — name the two or three read-only diagnostics you’d run (hint: one resolves the name, one checks the route/NSG) and what each result would tell you.
Hypothesise & test, fix, verify, prevent — state your single most likely hypothesis, the one test that confirms it, the fix, how you’d verify from the user’s perspective, and the prevention you’d leave behind.

A strong answer recognises that “timed out” to a private name points hardest at DNS (resolving to the wrong/public IP — nslookup, expect a private IP, fix the Private DNS zone link) or the network (NSG/UDR — check effective rules and next hop); that the Activity Log is where you verify “nothing changed”; and that the prevention is codifying the Private DNS zone link plus an alert on changes to it.

Certification mapping

This lesson maps to AZ-104: Microsoft Azure Administrator, chiefly the Monitor and maintain Azure resources domain:

Troubleshoot connectivity — NSG effective rules, effective routes, next hop, IP flow verify, DNS/private-endpoint resolution.
Troubleshoot virtual machines — boot diagnostics, serial console, run-command, the VM agent/extensions, deallocate vs stop.
Manage and troubleshoot identity/RBAC — AuthorizationFailed, Check access, scope and inheritance, sign-in logs and Conditional Access (overlaps SC-300).
Troubleshoot storage — firewall vs authorization vs SAS vs DNS; control-plane vs data-plane roles.
Troubleshoot App Service — log stream, Kudu, Application Insights, slots/swap (overlaps AZ-204).

The companion diagnostics toolkit lesson drills the specific tools the exam expects you to name. The method here is also what AZ-305 and real architecture interviews probe when they ask how you’d approach an unfamiliar failure.

Glossary

Reproduce — making a fault occur on demand, so you can confirm both the cause and, later, the fix.
Isolate the layer — determining which layer (identity, network, DNS, resource, application) is failing before changing anything.
Effective security rules / effective routes — the merged, applied NSG ruleset and route table on a NIC (subnet + NIC + system + BGP + UDR), versus what you configured in isolation.
IP flow verify — a Network Watcher check that says whether a 5-tuple would be allowed/denied by NSGs, and names the deciding rule.
Next hop — a Network Watcher check showing where a packet to a destination actually goes (internet, VNet, virtual appliance, none).
Boot diagnostics — a platform-captured screenshot and serial log of a VM as it boots, visible without logging in.
Serial console — a keyboard into a VM’s serial port via the platform, working without network/RDP/SSH.
Run-command — running a script as admin/root on a VM over the control plane, no SSH/RDP required.
Control plane vs data plane — managing a resource (Reader sees the account) versus accessing its data (Storage Blob Data Reader reads blobs); separate authorization systems.
Authentication vs authorization — proving who you are (sign-in; Conditional Access) versus what you’re allowed to do (RBAC; AuthorizationFailed).
Check access — the IAM feature showing a principal’s effective roles at a given scope.
Deallocate vs stop — releasing a VM’s compute (no compute charge) versus a stopped-but-allocated VM (still billed for compute).
Activity Log — the per-subscription audit of control-plane writes: who changed/deleted what, and when.
Prevention — the alert, policy, IaC guard, or runbook left behind so an incident can’t silently recur.

Next steps

You now have a method that works on any Azure failure and playbooks for the domains that break most often. The natural next move is to go deep on the tools the playbooks reference and learn to wield them fluently:

Next lesson: The Azure Diagnostics Toolkit: Network Watcher, Resource Health, Boot Diagnostics & KQL — every diagnostic here, hands-on, plus a teachable KQL query set for Log Analytics.