Azure Troubleshooting

Azure Troubleshooting Playbooks: Network, VM, Identity, Storage & Apps

There is a particular kind of panic that hits when something in Azure breaks in production. A web app starts returning 500s; a virtual machine you could reach yesterday refuses RDP; a perfectly valid storage request comes back 403; a colleague who definitely has access is told they don’t. The temptation is to start clicking — toggle a setting, restart the VM, regenerate a key, add an Owner role assignment — and hope. That is gambling, not troubleshooting, and it is how a five-minute incident becomes a two-hour one with three new misconfigurations layered on top of the original fault.

This lesson teaches the opposite habit: a repeatable method that turns “it’s broken and I don’t know why” into a short, ordered set of questions that converge on the real cause — plus per-domain playbooks (networking, VM, identity, storage, App Service) mapping a symptom to its likely cause, the one diagnostic that confirms it, and the fix. A senior engineer is not someone who memorises every error code; it is someone who can take an unfamiliar failure and narrow it down calmly. By the end you will have that instinct, plus a reference to keep open during an incident.

This is a methodology lesson — it teaches how to think. The next lesson, The Azure Diagnostics Toolkit, teaches the tools (Network Watcher, Resource Health, boot diagnostics, serial console, KQL) hands-on. Here we cover enough of each tool to act; there we go deep.

Learning objectives

By the end of this lesson you can:

Prerequisites & where this fits

You should already understand Azure’s core building blocks: subscriptions and resource groups, virtual networks and NSGs, virtual machines, Microsoft Entra ID and RBAC, and storage accounts — all covered in earlier lessons. You needn’t be an expert in any; troubleshooting is precisely the skill of reasoning about a system you only partly understand. This is Lesson G7, the first of two in the Operations & Troubleshooting module. It pairs with Azure Service Health, Advisor & Resource Graph (is it me or is it Azure?) and leads into the diagnostics-tools deep dive. Everything here maps to AZ-104, where “monitor and troubleshoot Azure resources” is an exam domain.

The troubleshooting mindset: eight steps that always work

Tools change; the method does not. Whether you are debugging a 2003-era on-prem server or a 2026 Azure landing zone, the same loop applies. Internalise these eight steps and you will never again be the person frantically toggling settings.

# Step The question it answers Why it matters
1 Reproduce Can I make it fail on demand? A fault you can’t reproduce, you can’t confirm you’ve fixed. Pin down exactly who, what, where, when.
2 Isolate the layer Which layer is actually failing — identity, network, the resource, or the app? This is the master move. Most wasted time comes from fixing the wrong layer.
3 Config vs desired Does the current configuration match what I intended? Most Azure incidents are a config drift or a recent change, not a platform fault.
4 Inspect logs & metrics What does the evidence say? Logs and metrics are ground truth. Read them before theorising, not after.
5 Hypothesise & test What’s my single best guess, and what one test confirms or kills it? One variable at a time. A test that can only “succeed” proves nothing.
6 Fix What is the smallest change that addresses the root cause? Fix the cause, not the symptom; change one thing so you know what worked.
7 Verify Is it actually fixed — from the user’s perspective? Re-run the reproduction from step 1. “It looks fine in the portal” is not verification.
8 Prevent How do I make sure this never silently recurs? Turn the fix into an alert, a policy, a runbook, or a test. This is what makes you senior.

A few principles make the loop sharper:

Isolating the layer — the master skill

Step 2 deserves its own model because it is where time is won or lost. Almost every Azure failure lives in one of five layers. Ask the questions top to bottom and you will usually localise the fault in under a minute:

Layer “Is the problem here?” — quick test Typical symptoms
Identity / authorization Can the caller authenticate, and do they have the role for this action? AuthorizationFailed, 403 on the control plane, “you don’t have access”, sign-in blocked
Network / connectivity Can the packet physically reach the target on the right port? Timeouts, “connection refused”, can’t RDP/SSH, intermittent drops
DNS / name resolution Does the name resolve to the IP you expect? “host not found”, connecting to the public IP of a private resource
The resource / platform Is the resource itself healthy and running? VM won’t boot, service degraded, 503 from a stopped backend
The application Is the code/config inside the resource the problem? App-level 500, stack traces, bad connection string, failed dependency

The trick is to test the cheapest, most likely layer first and to bisect: if a request fails from one machine but succeeds from another, the difference between them is your fault. If the same request fails everywhere, the problem is central (the resource or its config), not the caller.

Azure troubleshooting decision tree

The decision tree above is the same logic rendered as a flowchart: start from the symptom, ask “can it authenticate?”, then “can the packet arrive?”, then “does the name resolve?”, then “is the resource healthy?”, and finally “is it the app?” — branching to the matching playbook below at each point.

Networking playbook: “I can’t connect to it”

Connectivity is the single biggest source of Azure incidents, and almost all come down to four things: an NSG (network security group) rule blocking traffic, a UDR (user-defined route) sending the packet somewhere unexpected, DNS resolving the wrong address, or the target service not listening. Beginners stare at the VM; the fix is to inspect the effective configuration Azure actually applied, because NSGs combine at subnet and NIC level and routes combine system + BGP + UDR.

Three Network Watcher checks resolve most cases. IP flow verify answers “would an NSG allow this exact 5-tuple (source IP, dest IP, port, protocol, direction)?” and names the deciding rule. Next hop answers “where does a packet to this destination actually go — internet, VNet, a virtual appliance, or a black hole?”. Effective security rules and effective routes (on the NIC) show the merged, real ruleset rather than what you think you configured.

Symptom Likely cause Diagnostic step Fix
Connection times out (no response) NSG denying inbound on the port Network Watcher → IP flow verify for the dest port; or read effective security rules on the NIC Add/adjust an inbound Allow rule at the right priority (lower number wins); narrow source to your IP, not Any
Connection refused (fast reject) The service isn’t listening / wrong port / firewall inside the OS Confirm the app is bound (ss -tlnp / netstat); check the OS firewall Start the service / bind the right port; open the host firewall (Windows Firewall, ufw)
Traffic silently disappears A UDR forces 0.0.0.0/0 to a firewall/NVA that drops or isn’t up Network Watcher → Next hop for the destination Fix the route table, ensure the NVA/Azure Firewall is healthy, or correct the next-hop type
Reaches the wrong server, or “host not found” DNS resolving to a stale/public IP nslookup/dig the name from the client; check the VNet’s DNS servers and any Private DNS zone Point the VNet at the right DNS; fix the A record / Private DNS zone link
Works from one subnet, fails from another NSG/route differs, or peering lacks AllowForwardedTraffic/gateway transit Compare effective rules/routes on both NICs; check peering settings Align the NSG/UDR; enable the required peering options
Can’t reach a PaaS resource over private endpoint Private DNS not resolving the privatelink name; still using public IP nslookup the resource FQDN — it should return a private IP Link the privatelink.* Private DNS zone to the VNet; add the A record
Intermittent outbound failures under load SNAT port exhaustion on the outbound path Check Load Balancer/NAT gateway SNAT metrics Add a NAT gateway for deterministic egress; reuse connections in the app
Can’t RDP/SSH at all from the internet Management port exposed-then-blocked, or you should use Bastion IP flow verify on 3389/22; check for a deny rule Connect via Azure Bastion (no public IP needed); see the Bastion lesson

A grounding example: a VM is unreachable on port 443. IP flow verify returns Allow, rule AllowHTTPS — so the NSG is not the problem; you’ve eliminated a whole layer in one command. Next hop returns VirtualAppliance pointing at a firewall that’s showing unhealthy. There’s your cause, found in two read-only checks, no guessing.

Virtual machine playbook: “it won’t boot / I can’t log in”

When a VM misbehaves, separate two very different failures: the VM won’t start (a platform/OS-boot problem) versus the VM is running but you can’t connect (almost always network or in-guest config — see the networking playbook for the connectivity half). Two tools are your superpowers: boot diagnostics (a platform-captured screenshot and serial log of the boot, so you can see a kernel panic, a Windows recovery screen or a stuck fsck without logging in) and the serial console (a keyboard into the VM’s serial port via the platform, working with no network, NSG path or RDP/SSH). For changes from outside the OS, run-command runs a script as administrator/root over the control plane — no SSH required.

Symptom Likely cause Diagnostic step Fix
VM stuck / won’t boot OS corruption, bad fstab, failed update, full OS disk Boot diagnostics screenshot + serial log Serial console to edit fstab/grub; or attach OS disk to a rescue VM and repair
Provisioning/agent timeout, VM “running” but unmanageable Azure VM Agent stopped or unhealthy Check VM Agent status; review extension provisioning state Restart the agent (serial console / run-command); reinstall the agent
Can’t RDP (Windows) NSG, or RDP service/firewall in-guest, or expired credentials IP flow verify 3389; boot diagnostics for a logon screen; run-command Fix NSG; via run-command re-enable RDP / reset firewall; Reset password (VMAccess extension)
Can’t SSH (Linux) NSG, sshd down, bad authorized_keys/permissions IP flow verify 22; serial console login Restart sshd; fix key/permissions; Reset SSH key (VMAccess extension)
Boots then reboots in a loop Failed driver/update, kernel mismatch, disk pressure Serial log across reboots Boot previous kernel via serial console; roll back the update
Extension fails (ProvisioningState/failed) Script error, dependency, network to download source Read extension status message; check /var/log/azure or C:\WindowsAzure\Logs Fix the script/dependency; remove and re-add the extension
VM unexpectedly deallocated, or “out of capacity” on start Spot eviction, or no capacity in the size/zone Activity Log; check eviction events Use a different size/zone; for steady workloads avoid spot; consider capacity reservations
Performance cliff (slow disk) Hitting disk IOPS/throughput cap or VM cap VM/disk metrics (IOPS, throughput, credits) Resize disk tier (e.g. Premium SSD v2), enable bursting, or resize the VM

Note the deallocate vs stop trap: stopping a VM from inside the OS leaves it allocated — you keep paying for compute. Only deallocating (portal “Stop” or az vm deallocate) releases the compute charge. A VM that “won’t start” after a stop is sometimes just hitting a capacity constraint on reallocation.

Identity & RBAC playbook: “access denied”

Authorization failures split into two questions, and conflating them wastes time. First: can the principal sign in at all (authentication)? A blocked or challenged sign-in is Microsoft Entra ID, usually Conditional Access. Second: once signed in, are they allowed this specific action (authorization)? That’s RBAC — the AuthorizationFailed / 403 on a control-plane operation. Different tools answer each.

For authentication, the Entra sign-in logs are definitive: every sign-in records success/failure, the failure reason, the IP and device, and which Conditional Access policy applied and what it required (MFA, compliant device, block). For authorization, Check access on the resource shows exactly which roles a principal has at that scope — remembering RBAC is additive and inherited down management group → subscription → resource group → resource, and a Deny assignment or Azure Policy can override an Allow.

Symptom Likely cause Diagnostic step Fix
AuthorizationFailed / 403 on a portal action Missing RBAC role at this scope Resource → Access control (IAM)Check access for the user Assign the least-privilege role (e.g. Contributor on the RG, not Owner on the sub)
Has a role but still denied Deny assignment, Azure Policy deny, or scope mismatch Check access shows the deny; review Policy compliance Adjust the policy/exemption; assign the role at the correct scope
Sign-in blocked entirely Conditional Access policy (location, device, risk) Entra → Sign-in logs → the entry’s Conditional Access tab Adjust the CA policy or satisfy it (trusted location, compliant device); test in report-only first
Prompted for MFA unexpectedly / can’t complete MFA CA requiring MFA; no registered method Sign-in logs → CA result; check Authentication methods Register a method; scope/adjust the CA policy; check number-matching
App/script gets 401/403 (not a human) Managed identity not assigned the role, or wrong identity used Confirm the identity’s role with Check access; verify the token’s oid/appid Grant the managed identity the role; ensure the app requests the right identity
Role assignment “done” but no effect RBAC propagation delay, or assigned at wrong scope Re-check access; wait a few minutes; sign out/in for a fresh token Re-assign at correct scope; refresh the token (token caches old claims)
Guest (B2B) user can’t access External collaboration / CA settings; not invited to resource Sign-in logs for the guest; Check access Fix invitation/cross-tenant settings; assign the role to the guest
PIM-eligible admin lacks access right now Role is eligible, not active Check PIM assignments Activate the eligible role (PIM) for the session

A common trap: someone is “definitely an Owner” but gets AuthorizationFailed. Check access reveals they’re Owner on a different resource group, or their role is eligible via PIM and not activated, or a cached token predates the assignment. Read the evidence; don’t escalate privileges to paper over a scope mistake.

Storage playbook: “403 / authentication failed”

Storage 403s feel mysterious because a request passes four independent gates, and any one returns 403. In order: firewall (does the network allow this caller?), authorization (is the credential — account key, SAS, or Entra/RBAC — valid and sufficient?), request validity (SAS expired, clock skewed, permission missing?), and for private access, DNS (does the endpoint resolve to the private endpoint?). Diagnose in that order, because a network block and an auth failure look identical from the client.

Symptom Likely cause Diagnostic step Fix
403 AuthorizationFailure from an allowed identity Storage firewall blocks the caller’s network Storage → Networking: is it “selected networks”? Is the caller’s IP/VNet listed? Add the IP/subnet (service or private endpoint), or use a trusted-services exception
403 AuthenticationFailed with account key Wrong/rotated key, or clock skew on the client Verify the key; check the client’s time is in sync Use the current key (prefer Entra auth over keys); fix NTP
403 from a SAS URL SAS expired, wrong permissions (e.g. read-only used for write), or wrong signed resource/IP Decode the SAS: check se (expiry), sp (perms), sip, spr Reissue with correct perms/expiry; prefer user-delegation SAS (Entra-backed)
403 doing data ops with RBAC Has Reader (control plane) but not a data role Check access: look for Storage Blob Data roles Assign Storage Blob Data Reader/Contributor — control-plane roles don’t grant data access
Connects to the public endpoint despite a private endpoint Private DNS not resolving the privatelink name nslookup account.blob.core.windows.net — expect a private IP Link the privatelink.blob.core.windows.net zone to the VNet; add the A record
409/PublicAccessNotPermitted on anonymous blob read Anonymous/public access disabled (the secure default) Check allow blob anonymous access setting Use SAS or Entra auth; only enable anonymous if truly required
Intermittent throttling (503/500) Hitting account scalability targets (IOPS/egress) Storage metrics (transactions, throttled requests) Spread load, use multiple accounts, or move to premium; back off and retry

The single most common storage 403: an identity has Reader on the account and tries to read a blob, and is denied. Reader is a control-plane role — it lets you see the account exists. Reading data needs a Storage Blob Data role. Control plane and data plane are separate authorization systems — heavily tested, heavily tripped over.

App Service playbook: “the app returns 5xx”

App Service hides the VM, so debugging shifts to streaming logs, the Kudu/SCM advanced-tools site, and Application Insights. Distinguish platform 5xx from application 5xx: a 503 often means the app failed to start, was stopped, or is throttled/cold; a 500 is usually your code throwing. The fastest first move is Log stream (live stdout/stderr and web-server logs) to see the actual exception or startup error rather than guess from the status code.

Symptom Likely cause Diagnostic step Fix
503 Service Unavailable App failed to start, is stopped, or plan is overloaded/cold Log stream; check the app is Running; Resource Health Fix the startup error (logs); scale up/out; enable Always On to avoid cold idle
500.30 / container won’t start Bad startup command, missing dependency, wrong port Log stream + Kudu (/api/logs/docker); check the listening port Correct startup command/port; set WEBSITES_PORT; fix dependency
App 500 with stack trace Application code exception Application Insights → Failures; Log stream Fix the code/config; add the missing app setting
Works locally, fails in Azure Missing app setting/connection string, or runtime version mismatch Compare Configuration app settings; check stack version Add settings (they become env vars); pin the correct runtime version
403/401 to a downstream resource App’s managed identity lacks the role, or Key Vault reference broken Check access on the target; check Key Vault reference status Grant the identity the role; fix the @Microsoft.KeyVault(...) reference
Slow / timeouts after deploy Cold start, under-scaled plan, or SNAT exhaustion to a DB App Insights performance; check SNAT/connection metrics Always On + scale; reuse connections; integrate a NAT gateway/VNet
502 Bad Gateway intermittently Worker crashing/recycling, or downstream timeout Log stream during the failure; check process restarts Fix the crash; raise downstream timeouts; right-size the plan
Custom domain / TLS errors Missing binding, expired cert, or DNS not pointing at the app Check Custom domains + binding; nslookup the CNAME Fix the binding/cert; correct the CNAME/A record

Tie this to deployments: many App Service 5xxs appear immediately after a deploy. The robust pattern is deployment slots with a swap — warm up the new version on a staging slot, verify, then swap into production, so a bad build never takes the live site down (and swap-back is instant rollback). Covered in Secure, zero-downtime deployments on App Service.

Hands-on lab: diagnose a deliberately broken VM

In this lab you will create a fault on purpose, then use the method to find and fix it. We’ll build a small VM, lock its NSG so SSH is blocked, diagnose the block with Network Watcher (never touching the VM), then fix it. Everything uses standard/B-series resources and is deleted at the end. Run it in Cloud Shell (Bash).

1. Set up variables and a resource group.

RG=rg-ts-lab
LOC=eastus
az group create -n $RG -l $LOC -o table

2. Create a small Linux VM (this also creates a VNet, subnet, NIC, public IP and a default NSG).

az vm create \
  -g $RG -n vm-ts \
  --image Ubuntu2204 \
  --size Standard_B1s \
  --admin-username azureuser \
  --generate-ssh-keys \
  --public-ip-sku Standard \
  -o table

Note the publicIpAddress in the output. Confirm SSH works (answer yes to the host-key prompt, then exit):

ssh azureuser@<publicIpAddress> 'echo connected; exit'

3. Break it. Add a high-priority rule that denies inbound SSH — simulating “someone tightened the firewall and now I can’t get in”. The NSG is vm-tsNSG by default:

az network nsg rule create \
  -g $RG --nsg-name vm-tsNSG \
  -n DenySSH --priority 100 \
  --direction Inbound --access Deny \
  --protocol Tcp --destination-port-ranges 22 \
  --source-address-prefixes '*' -o table

Now retry the SSH from step 2 — it hangs and times out. Resist the urge to recreate the VM. Apply the method.

4. Isolate the layer with Network Watcher (read-only). Get the NIC ID, then ask “would an NSG allow inbound SSH?” with IP flow verify:

NIC=$(az vm show -g $RG -n vm-ts --query 'networkProfile.networkInterfaces[0].id' -o tsv)
PRIV=$(az network nic show --ids $NIC --query 'ipConfigurations[0].privateIPAddress' -o tsv)

az network watcher test-ip-flow \
  -g $RG \
  --nic $NIC \
  --direction Inbound --protocol TCP \
  --local "$PRIV:22" --remote "1.2.3.4:12345" \
  -o table

Expected output: access = Deny, ruleName = ...DenySSH. In one read-only command you’ve proven the fault is the NSG (not the VM, not SSH, not your client) and named the offending rule — the whole point of the method.

5. Confirm with effective rules (optional, corroborating evidence).

az network nic list-effective-nsg --ids $NIC \
  --query "value[].effectiveSecurityRules[?destinationPortRange=='22']" -o table

You’ll see DenySSH sitting above the default allow — the merged, real ruleset Azure applied.

6. Fix the root cause (remove the bad rule) and verify by re-running the original reproduction:

az network nsg rule delete -g $RG --nsg-name vm-tsNSG -n DenySSH -o table
# Verify from the user's perspective — the reproduction from step 2:
ssh azureuser@<publicIpAddress> 'echo reconnected; exit'

It connects again. You diagnosed and fixed a connectivity incident without ever logging into the VM — because the evidence pointed at the network layer.

7. Prevent (discuss). In production you’d codify the working NSG in Bicep/Terraform so an ad-hoc deny can’t drift in unreviewed, and add an Activity Log alert on NSG rule changes — guards covered in the diagnostics and policy lessons.

Cleanup — delete everything so you pay nothing further:

az group delete -n $RG --yes --no-wait

Cost note. A Standard_B1s VM plus a Standard public IP runs a few US cents per hour; finish and delete within the hour and the total is a rounding error (typically under ₹10 / ~US$0.10). The Network Watcher checks are effectively free. --no-wait returns immediately while Azure deletes in the background — confirm the resource group is gone afterwards so nothing lingers on the bill.

Common mistakes & troubleshooting

The meta-mistakes — the errors people make while troubleshooting — cost more than any single misconfiguration:

Mistake Why it bites Do this instead
Changing several settings at once You can’t tell what fixed it (or what broke worse) Change one variable, test, then the next
Fixing the symptom, not the cause The incident recurs tomorrow Trace to root cause; capture a prevention (step 8)
Trusting the intended config over the effective config NSGs/routes merge; what you set ≠ what’s applied Read effective rules/routes, sign-in logs, real DNS
Confusing authentication with authorization You add roles when sign-in is the block (or vice versa) Sign-in logs for auth; Check access for RBAC
Confusing control plane with data plane (storage) Reader won’t read a blob; you grant the wrong role Use Storage Blob Data roles for data ops
Forgetting RBAC propagation / token caching “I assigned the role and it still fails” Wait a few minutes; sign out/in for a fresh token
“Stopping” a VM in-guest and still being billed Compute stays allocated Deallocate (portal Stop / az vm deallocate) to stop compute charges
Skipping “what changed?” You debug from scratch when a deploy caused it Check the Activity Log first

Best practices

Security notes

Troubleshooting under pressure is exactly when security hygiene erodes — guard against it:

Interview & exam questions

1. Walk me through how you troubleshoot an Azure issue. The loop: reproduce → isolate the layer → compare config vs desired → inspect logs/metrics → hypothesise and test (one variable) → fix the root cause → verify by re-running the reproduction → prevent. Emphasise isolating the layer and read-only first.

2. A VM is unreachable on port 443. First move? Network Watcher IP flow verify for inbound TCP/443 — it returns Allow/Deny and the deciding rule. If Allow, the NSG is exonerated; check next hop (a UDR to a dead NVA?), the in-guest service/firewall, and DNS. Don’t touch the VM until evidence points there.

3. Distinguish authentication from authorization failures, and the tool for each. Authentication = can you sign in (Entra, often Conditional Access) — use sign-in logs. Authorization = are you allowed this action (RBAC, the AuthorizationFailed/403) — use Check access. Conflating them is the classic error.

4. A user with Reader on a storage account gets 403 reading a blob. Why? Reader is a control-plane role — it shows the account exists but grants no data access. Reading blobs needs a data-plane role like Storage Blob Data Reader/Contributor; the two are separate authorization systems.

5. Name the four gates a storage request passes, in order. Firewall (network allowed?) → authorization (key/SAS/RBAC valid and sufficient?) → request validity (SAS expiry/perms, clock skew) → DNS (private-endpoint resolution). Any one returns 403; diagnose in that order.

6. How do you log into a VM with no working network and no RDP/SSH? The serial console — a keyboard into the VM’s serial port via the platform, independent of NSGs/network — backed by boot diagnostics to see the boot state. For external changes, run-command runs a script with admin/root rights over the control plane.

7. Stopping vs deallocating a VM — the difference, and why it matters? Stopping in-guest leaves the VM allocated — you keep paying for compute. Deallocating (portal Stop / az vm deallocate) releases the compute charge (disks still billed). A “stopped” VM still on the bill, or a deallocated VM that won’t restart due to capacity, are common surprises.

8. App Service returns 503. How do you approach it? 503 usually means the app failed to start, is stopped, or is cold/throttled. First: Log stream to read the actual startup error; confirm it’s Running; check Resource Health and Application Insights failures. Fixes: correct the startup error, enable Always On, scale, add the missing app setting.

9. You assigned an RBAC role but the user still can’t access. Three reasons. (a) Wrong scope; (b) propagation delay / cached token with old claims (sign out/in); © a Deny assignment or Azure Policy overrides the Allow. Check access reveals which.

10. A private-endpoint resource is reached on its public IP — what’s wrong, and how do you confirm? The Private DNS zone (e.g. privatelink.blob.core.windows.net) isn’t linked to the VNet or lacks the A record, so the name resolves to the public IP. Confirm with nslookup of the FQDN — it should return a private IP. Fix by linking the zone and adding the record.

11. How do you safely test a Conditional Access change you suspect is blocking sign-ins? Report-only mode: the policy is evaluated and logged in the sign-in logs (you see whether it would grant/block/require MFA) without enforcing it — so you validate before turning it on.

12. What turns a junior troubleshooter into a senior one? The last step: prevention. Juniors fix the symptom; seniors trace the root cause and leave behind an alert, a policy, infrastructure-as-code, or a runbook so it can’t silently recur — and they reason by isolating the layer instead of guessing.

Quick check

  1. In the eight-step method, which step is the “master move” that saves the most time, and what does it determine?
  2. A storage request from an identity you know has the right role still returns 403. Name two layers that could cause this and how you’d tell them apart.
  3. You can’t SSH to a VM. Which single Network Watcher check tells you whether an NSG is responsible — and what extra information does it give you?
  4. A user “is an Owner” but gets AuthorizationFailed. Give two plausible explanations.
  5. True or false: stopping a VM from inside its operating system stops you being billed for compute. Explain.

Answers

  1. Isolate the layer (step 2). It determines which layer is failing — identity, network, DNS, resource, or app — so you fix the right thing instead of the first thing.
  2. The firewall (caller’s network not allowed) and authorization (credential/role insufficient — e.g. a control-plane role used for a data-plane op). Tell them apart via Networking on the account (is the caller’s IP/VNet allowed?) versus Check access for the right Storage Blob Data role; a private-endpoint DNS miss is a third possibility.
  3. IP flow verify — it returns Allow/Deny for the exact 5-tuple and names the deciding rule, confirming whether the NSG is the cause and which rule to change.
  4. Owner at a different scope (another resource group); role is eligible via PIM, not activated; or a cached token predates the assignment (sign out/in). A Deny assignment/Policy could also override it.
  5. False. Stopping in-guest leaves the VM allocated — compute charges continue. You must deallocate (portal “Stop” or az vm deallocate) to release the compute charge (disks remain billed).

Exercise

You’re handed this incident cold: “Our internal API on api.internal.contoso.com, hosted on a VM behind a private endpoint, started returning errors at 09:10. Users get ‘connection timed out’. The app team swears nothing changed.”

Work it with the method and write down, for each step, what you would do and why:

  1. Reproduce — what exactly do you try, and from where (inside the VNet vs outside)?
  2. Isolate the layer — walk the five layers; for this symptom (“connection timed out” to a private-endpoint name), which layers are most likely and which can you quickly rule out?
  3. Config vs desired / what changed — where do you look to test “nothing changed”?
  4. Inspect — name the two or three read-only diagnostics you’d run (hint: one resolves the name, one checks the route/NSG) and what each result would tell you.
  5. Hypothesise & test, fix, verify, prevent — state your single most likely hypothesis, the one test that confirms it, the fix, how you’d verify from the user’s perspective, and the prevention you’d leave behind.

A strong answer recognises that “timed out” to a private name points hardest at DNS (resolving to the wrong/public IP — nslookup, expect a private IP, fix the Private DNS zone link) or the network (NSG/UDR — check effective rules and next hop); that the Activity Log is where you verify “nothing changed”; and that the prevention is codifying the Private DNS zone link plus an alert on changes to it.

Certification mapping

This lesson maps to AZ-104: Microsoft Azure Administrator, chiefly the Monitor and maintain Azure resources domain:

The companion diagnostics toolkit lesson drills the specific tools the exam expects you to name. The method here is also what AZ-305 and real architecture interviews probe when they ask how you’d approach an unfamiliar failure.

Glossary

Next steps

You now have a method that works on any Azure failure and playbooks for the domains that break most often. The natural next move is to go deep on the tools the playbooks reference and learn to wield them fluently:

Related reading:

AzureTroubleshootingNetwork WatcherRBACApp ServiceAZ-104
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading