Azure Troubleshooting

Troubleshooting Azure App Service: 502/503 Errors, Cold Starts & Restart Loops

At 02:14 your phone buzzes: the public site is throwing 502 Bad Gateway. You hit refresh — sometimes it loads, sometimes it doesn’t. The deployment went out six hours ago and was fine. Nothing changed. Except something always changed. This is the most common production incident on Azure App Service — the platform-as-a-service that runs your web apps, APIs and web jobs on a managed fleet of Windows or Linux workers behind a shared front end — and it is maddening because the status code you see (502, 503) is reported by the front-end load balancer, not by your app. The front end is saying “I couldn’t get a good answer from your worker.” Why it couldn’t is the entire game, and at least a dozen distinct root causes hide behind those three digits.

This is the diagnostic playbook. We treat 502, 503, cold starts and restart loops not as four bugs but as four symptom classes, each with a fan-out of root causes you confirm with specific commands. You will learn to read the request pipeline — client → Front Door/App Gateway (optional) → App Service front end → worker (process or container) → outbound SNAT — and to localise a failure to exactly one hop, using the tools that tell the truth: az webapp log tail, the Kudu/SCM console, Diagnose and solve problems, Application Insights Failures + Live Metrics, the health check blade, and container logs. Every diagnosis comes with the exact path to confirm it and the precise fix, with both az CLI and Bicep (and KQL where the answer lives in logs). Because this is a reference you will return to mid-incident, the playbook itself, the error codes, the app settings and the plan tiers are all laid out as scannable tables — read the prose once, then keep the tables open at 02:14.

By the end you will stop guessing. When the pager goes off you will know whether you face a crashed worker, a container that never bound to the assigned port, a plan that ran out of instances, a Key Vault reference that failed at boot, SNAT port exhaustion from your own outbound calls, or simply a cold start Always On would have prevented. Knowing which within ninety seconds is what separates a five-minute incident from a two-hour one.

What problem this solves

App Service hides enormous machinery so you can git push and have a running web app. That abstraction is a gift until it breaks, then it becomes an opaque wall. The bare 502/503 HTML page deliberately tells you almost nothing — exposing internals to an anonymous caller would be a security leak. So the information you need is real and captured, but it lives in five or six different places, and if you don’t know which place maps to which failure you burn an hour clicking through blades.

What breaks without this knowledge: an on-call engineer restarts the app (which sometimes “fixes” it by accident, teaching the wrong lesson), scales up the plan (masking SNAT exhaustion for a day before it returns worse), or opens a support ticket and waits. Meanwhile the actual cause — a container listening on 3000 while App Service probes 8080, a deployment slot never warmed, or a health-check path that returns 500 because a downstream is down — sits there, perfectly diagnosable, ignored.

Who hits this: most teams running PaaS web apps, APIs or containers. It bites hardest on Linux container apps (the WEBSITES_PORT problem is near-universal for first-time deployers), apps with chatty outbound HTTP (SNAT exhaustion), cost-sensitive deployments without Always On (cold starts), and anyone using Key Vault references in app settings (boot-time failures that look like random restart loops). The fix is almost never “scale up” — it’s “find the hop that’s lying and make it tell the truth.”

To frame the whole field before the deep dive, here is every symptom class this article covers, the question it forces, and the one place to look first:

Symptom class What the front end is saying First question to ask First place to look Most common single cause
502 Bad Gateway “I reached a worker but got a bad/no answer” Did App Service even see the request fail? az webapp log tail + App Insights Failures Container not on WEBSITES_PORT, or upstream timeout
503 Service Unavailable “I had no healthy worker to ask” Is the plan out of capacity, or are instances being evicted? Diagnose and solve → Application Restarts Restart in progress on a single instance
Cold start (slow first request) (not an error — just latency) Was the worker idle/just-deployed/just-swapped? App Insights request duration after gaps Always On off → idle unload after ~20 min
Restart loop (flapping 502/503) “the worker keeps dying or never goes healthy” Same exception every boot, or memory-pinned? az webapp log tail (repeating trace) Bad app setting or failed Key Vault reference
SNAT exhaustion “outbound is failing under load” (shows as 502/timeouts) Does it pass at rest and fail under load? Diagnose and solve → SNAT Port Exhaustion New HttpClient/socket per request

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the App Service basics: an App Service plan is the set of VM workers (an SKU like B1, P1v3) you rent, and one or more web apps run on that plan, sharing its CPU, memory and instance count. You should know how to run az in Cloud Shell, read JSON output, and that App Service has deployment slots (staging/production swap targets). Familiarity with HTTP status codes and basic Linux/Windows process concepts helps.

This sits in the Observability & Troubleshooting track. It assumes the compute fundamentals (the Azure App Service vs Container Apps vs AKS decision is upstream of it) and the platform mechanics from the Azure App Service Deep Dive: Plans, Scaling, Slots, TLS. It pairs tightly with Azure Monitor & Application Insights for observability, because Application Insights is the single most useful tool in this entire playbook. If you run App Service behind a front end, Application Gateway with WAF is the layer where some of these timeouts originate.

A quick map of who confirms what during an incident, so you call the right person fast:

Layer What lives here Who usually owns it Failure classes it can cause
Client / DNS TLS, name resolution, retries Frontend / SRE 502/503 only if misrouted; mostly red herrings
Front Door / App Gateway WAF, backend timeout, probes Network team 502 (upstream timeout), 403 (WAF/IP rules)
App Service front end (ARR) Worker selection, port probe Microsoft (platform) 502 (no good answer), 503 (no worker)
Worker (process / container) Your code, runtime, port bind App / dev team 502 (crash, wrong port), restart loop
App settings / Key Vault Config, secrets, identity App + platform Restart loop (bad setting / KV ref)
Outbound (SNAT / NAT GW) Egress to DB / APIs Platform + network 502/timeouts under load (SNAT)

Core concepts

Five mental models make every later diagnosis obvious.

The status code names the front end’s complaint, not your bug. Every request goes through a front-end role (a shared layer running ARR — Application Request Routing) that picks a worker and proxies to it. A 502 Bad Gateway means the front end reached a worker but got a broken/no response (connection refused, reset, HTTP-violating, or a timeout waiting). A 503 Service Unavailable means it could not get a healthy worker at all — none available, app recycling, platform restarting, or a quota blocked it. “Bad answer from the worker” (502) versus “no worker to ask” (503) is the first fork in every decision tree.

Your worker is a process the platform babysits. On Windows your app runs under w3wp.exe; on Linux/containers it’s a process (often a Docker container the platform pulls and starts). The platform recycles (kills and restarts) it on triggers: crash, config change, deployment, failed health check, exceeding the startup time limit, or memory pressure. A restart loop is this recycle firing repeatedly because the app dies or fails to become healthy faster than it can stay up — the platform doing exactly what you told it, against an app that can’t stay alive.

The port contract is explicit and unforgiving. App Service tells your app which TCP port to listen on. Windows uses the injected HTTP_PLATFORM_PORT/ASPNETCORE_URLS; Linux built-in stacks honour the PORT env var; Linux custom containers must declare their port via the WEBSITES_PORT app setting (default probe is port 80). If your container listens on 3000 and you never set WEBSITES_PORT=3000, the front end probes 80, gets connection-refused, and returns 502 forever — even though your container is healthy. This is the number-one container failure and has nothing to do with your code.

Allocation is finite and shared. A plan has a fixed instance count and a per-instance memory ceiling tied to the SKU, shared by every app on it. SNAT (Source Network Address Translation) ports — the pool mapping your outbound connections to a shared public IP — are also finite (roughly 128 pre-allocated per instance, expandable but bounded). Burn through any of these and you get 5xx that looks like app bugs but is resource exhaustion; the CPU/memory/SNAT metrics tell you which ceiling you hit.

Cold start is latency, not an error. A worker with no warm process — idle and unloaded (no Always On), just deployed, scaled out to a new instance, or swapped in without warm-up — makes the first request pay for process start: runtime boot, JIT compilation (.NET/JVM), DI container build, connection-pool prime, and for containers an image pull and container start. That can take 10–60+ seconds. It’s not a 502 unless it exceeds a timeout; it’s a slow request, fixed by ensuring a warm worker always exists.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters to 502/503
App Service plan The rented VM workers (SKU + count) Subscription / resource group Capacity ceiling; over-commit → 503
Web app (site) One app running on a plan On the plan The thing that crashes / flaps
Front end (ARR) Shared layer that proxies to a worker Microsoft-managed Emits 502/503 codes
Worker The process/container serving requests On a plan instance Crashes, wrong port, OOM
WEBSITES_PORT Port your container listens on App setting Wrong/unset → 502 forever
SNAT port Outbound connection → shared IP mapping Per instance (~128) Exhaustion → outbound 502/timeouts
Always On Keeps a warm worker resident Site config (B1+) Off → cold-start latency
Health check Path probed per instance Site config Bad path evicts all → 503
Deployment slot A swappable copy of the app On the plan Cold/no-warm-up swap → 502
Key Vault reference App setting @Microsoft.KeyVault(...) App setting + identity Fails to resolve → crash loop
Recycle Platform kills + restarts the worker Platform behaviour Repeated → restart loop
Cold start First-request latency on a fresh worker Worker lifecycle Slow first request; can trip timeouts

The HTTP status-code reference

Before the per-symptom anatomy, here is the lookup table you scan first: every status code you realistically see from an App Service app, what it actually means on this platform, the likely cause, how to confirm it, and the fix. The non-obvious ones are the ANCM 500.3x codes (the .NET Windows in-process hosting failures) and the difference between a platform 503 and an app-emitted 503.

Code Meaning Likely cause on App Service How to confirm First fix
502.3 Bad Gateway Front end got no/broken answer from worker Worker crashed, wrong WEBSITES_PORT, bound 127.0.0.1, upstream timeout, SNAT exhaustion az webapp log tail; default_docker.log port-probe line Match the symptom in the playbook below
503 Service Unavailable No healthy worker available Restart in progress, plan over-commit, quota, all instances evicted Diagnose and solve → Application Restarts Run ≥2 instances; check quota/health
500.30 ANCM In-Process Start Failure .NET app failed to start in-process (Windows) Unhandled startup exception, bad config, missing dependency eventlog.xml in Kudu; stdout log Fix startup code/config; enable stdout logging
500.31 ANCM Failed to Find/Load Runtime Required .NET runtime missing/mismatched Wrong target framework vs installed runtime dotnet --info / publish settings; eventlog.xml Self-contained deploy or match runtime version
500.32 ANCM Failed to Load dll Wrong-bitness or missing native dll x86/x64 mismatch, missing native dep eventlog.xml ANCM detail Match platform bitness; include native deps
500.37 ANCM Failed to Start Within Startup Time Limit App didn’t start before the ANCM timeout Slow startup (migrations, blocking init) eventlog.xml; startup duration Speed up startup; raise startup limit; fail soft
500.0 / 500 Internal Server Error App threw at runtime Unhandled exception in a request App Insights Failures (exceptions) Fix the throwing code path
504 Gateway Timeout Upstream waited too long for the worker Slow backend > Front Door/App GW timeout App Insights request duration vs timeout Speed up backend; raise upstream timeout
403 Forbidden (ip-restriction) Access restriction / private access blocked the caller IP rules, private endpoint, SCM lockdown Access restrictions blade; accessRestrictions Add the caller’s IP/range or fix routing
403 SSL required / cert HTTPS-only or client-cert rule httpsOnly, clientCertEnabled Site config; request scheme Use HTTPS; present required client cert
404 on a deployed app Wrong start path, missing default doc, run-from-package issue Bad wwwroot, wrong virtual app path Kudu /home/site/wwwroot listing Fix path / default document / package mount
409 Conflict (deploy) Concurrent deploy/swap or locked files Overlapping ZipDeploy, file in use Activity log; deployment center Serialise deploys; use run-from-package

Three reading notes that save the most time:

Distinction The trap How to tell them apart
Platform 503 vs app-emitted 503 Your app may itself return 503 (e.g. a maintenance handler) App Insights requests shows a request that ran and returned 503 → it’s yours; a platform 503 has no matching request row
502 from App Service vs from the gateway Hours wasted in the wrong logs If App Insights shows the request succeeding (slowly) but the client got 502, the gateway emitted it
500.3x (ANCM) vs generic 500 ANCM = startup, generic 500 = runtime 500.30/31/32/37 mean the worker never started; fix config/runtime, not a request handler

Anatomy of a 502 Bad Gateway

A 502 means the front end got a bad answer from a worker. Five distinct causes. Scan the matrix, then read the detail for whichever row matches:

# 502 cause Tell-tale signal Confirm with Real fix Band-aid that masks it
1 Worker crashed / threw at startup Stack trace + recycle in logs az webapp log tail; App Insights Failures Fix startup code/config; fail soft Restart (recurs in seconds)
2 Container not on WEBSITES_PORT “didn’t respond to HTTP pings on port: 80” default_docker.log Set WEBSITES_PORT; bind 0.0.0.0 None — it never works
3 Startup exceeds time limit Gap > 230 s between “starting” and “failing” default_docker.log timestamps Shrink image; same-region ACR; raise limit Raise limit only (still slow)
4 Upstream timeout (Front Door/App GW) App Service logs show success, client got 502 App Insights duration vs gateway timeout Speed up backend; raise upstream timeout Raise timeout only
5 SNAT port exhaustion Fails under load, fine at rest SNAT detector; SnatConnectionCount Failed Reuse connections; NAT Gateway Scale out (+128 ports/instance)

Cause 1 — The worker process crashed or threw at startup

Your code throws an unhandled exception at startup (bad connection string, missing env var, failed boot migration) or crashes under a specific request. The worker dies, the front end has nothing to proxy, you get 502 (and 503 while it recycles).

Confirm. Stream live logs and watch for the stack trace and recycle:

# Tail live application + platform logs (Ctrl-C to stop)
az webapp log tail --name app-shop-prod --resource-group rg-shop-prod

Then pull the Application Insights Failures view — it groups exceptions by type and shows the failing operation:

// Top server exceptions in the last hour, with the operation that threw
exceptions
| where timestamp > ago(1h)
| summarize count() by problemId, outerMessage, operation_Name
| order by count_ desc

In the portal: Diagnose and solve problemsAvailability and PerformanceWeb App Down / Application Crashes surfaces the same crash with the worker exit.

Fix. Fix the throwing code — for boot crashes, usually a misconfigured app setting (see Key Vault references below) or a dependency unreachable at startup. Make startup resilient: don’t run blocking migrations or hard-dependency checks synchronously in startup; fail soft and surface readiness via the health check instead.

Cause 2 — Container not listening on the assigned port (WEBSITES_PORT)

The classic. Your custom Linux container listens on 8000 or 3000, but App Service probes 80 by default. Connection refused → 502. The container logs show your app started fine — that’s what makes it confusing.

Confirm. Pull the container/startup logs and find the platform’s port-probe failure:

# Download the Docker/container startup log (Linux custom container)
az webapp log download --name app-api-prod --resource-group rg-shop-prod \
  --log-file logs.zip
# Inside, default_docker.log shows lines like:
#   "Container ... didn't respond to HTTP pings on port: 80, failing site start"
#   "Stopping site ... because it failed during startup."

That “didn’t respond to HTTP pings on port: 80” line is the dead giveaway.

Fix. Set WEBSITES_PORT to the port your container actually binds:

az webapp config appsettings set --name app-api-prod --resource-group rg-shop-prod \
  --settings WEBSITES_PORT=8000
resource site 'Microsoft.Web/sites@2023-12-01' = {
  name: 'app-api-prod'
  location: location
  properties: {
    serverFarmId: plan.id
    siteConfig: {
      linuxFxVersion: 'DOCKER|myregistry.azurecr.io/api:1.4.2'
      appSettings: [
        { name: 'WEBSITES_PORT', value: '8000' }
        // Make the container bind 0.0.0.0:8000, NOT 127.0.0.1 — see gotcha
      ]
    }
  }
}

The deeper gotcha: your app must bind 0.0.0.0 (all interfaces), not 127.0.0.1/localhost. A container that binds only loopback rejects the platform’s probe from outside the container even when the port is right. Here is how the port contract differs by stack — knowing your row removes the guesswork:

Hosting stack How the port is communicated Default the platform probes What you set Bind address required
Windows .NET (in-process) HTTP_PLATFORM_PORT injected → ANCM Injected port Nothing (ANCM handles it) Loopback (ANCM proxies)
Windows .NET (out-of-process) ASPNETCORE_URLS / HTTP_PLATFORM_PORT Injected port Honour ASPNETCORE_URLS localhost (ANCM proxies)
Linux built-in (Node/Python/Java/.NET) PORT env var The PORT value (often 8080) Read PORT; listen on it 0.0.0.0
Linux custom container WEBSITES_PORT app setting 80 if unset WEBSITES_PORT=<real port> 0.0.0.0 (mandatory)
Windows custom container WEBSITES_PORT app setting 80 WEBSITES_PORT=<real port> 0.0.0.0

Cause 3 — Startup exceeds the container start time limit

A heavy container (large runtime, slow init, image pull on a cold instance) takes longer to become responsive than the startup ceiling. Default is 230 seconds; the platform gives up, fails the start, and you get 502/503 on a flapping site.

Confirm. default_docker.log shows the start abandoned after the limit — timestamps between “Starting container” and “failing site start” exceed it.

Fix. Raise the limit (max 1800 seconds), but treat a long start as a smell:

az webapp config appsettings set --name app-heavy-prod --resource-group rg-shop-prod \
  --settings WEBSITES_CONTAINER_START_TIME_LIMIT=600
appSettings: [
  { name: 'WEBSITES_CONTAINER_START_TIME_LIMIT', value: '600' } // seconds, max 1800
]

Better fixes: shrink the image, move heavy init out of the critical path, warm the ACR by enabling the registry in the same region, and turn on Always On so the slow start happens once at deploy, not on every idle wake. What actually eats the startup budget, and what to do about each:

Startup cost Typical magnitude Reduce it by Trade-off
Image pull (custom container) 5–90 s for a 0.5–2 GB image Smaller base image, same-region ACR, layer caching Build discipline; multi-stage Dockerfiles
Runtime boot (.NET/JVM/Node) 1–10 s ReadyToRun / AOT, tiered JIT, trimming Larger artifacts; build complexity
DI graph + first DB connect 1–15 s Lazy init, async warm-up, pooled drivers First real request still primes pools
Key Vault reference resolution 0.2–3 s per secret Fewer references; cache; reference App Config Slightly less granular secret rotation
Migrations / schema checks at boot seconds to minutes Move out of startup; run in pipeline Need a migration gate in CI/CD

Cause 4 — Upstream timeout from Front Door or Application Gateway

When App Service sits behind Front Door or Application Gateway, that layer can emit the 502. If your worker takes longer than the front end’s backend timeout, the front end times out the upstream and returns 502 to the client while App Service logs show the request succeeding (just slowly). People stare at App Service logs for an hour because the 502 isn’t there.

Confirm. App Insights shows the request completing in, say, 95 s; the front end’s timeout is shorter. For Application Gateway, check the request-timeout setting:

# Application Gateway backend HTTP settings — check the request timeout (seconds)
az network application-gateway http-settings list \
  --gateway-name agw-shop --resource-group rg-shop-prod \
  --query "[].{name:name, timeout:requestTimeout, port:port, protocol:protocol}" -o table

For Front Door, the origin response timeout (default around 60 seconds, raisable up to 240 on Standard/Premium) is the equivalent. If App Service’s response time is climbing toward that number, the front end will start cutting requests.

Fix. Make the app respond faster (the right fix), and/or raise the upstream timeout to match a legitimately long operation:

az network application-gateway http-settings update \
  --gateway-name agw-shop --resource-group rg-shop-prod \
  --name appservice-settings --timeout 120

Also verify the gateway probe targets a fast health endpoint (not /, which may itself be slow), or it marks healthy backends unhealthy and starts 502-ing. The timeouts that matter, where they live, and their defaults:

Timeout Layer Default Max What hitting it looks like
Backend request timeout Application Gateway (HTTP settings) 20 s (newer) / 30 s 86,400 s 502 to client, App Service request succeeded
Origin response timeout Front Door Standard/Premium ~60 s 240 s 504/502 at the edge, origin still working
Idle connection timeout App Service load balancer ~230 s configurable via setting Long-poll/SignalR connections cut at ~4 min
Health-probe timeout App Gateway / Front Door probe seconds configurable Healthy backend marked unhealthy → 502
Client (browser/SDK) timeout Caller varies n/a Client gives up; not an App Service issue

Cause 5 — SNAT port exhaustion from the app’s own outbound calls

The cruel one. Your app makes outbound HTTP calls (database, third-party API, another microservice) and — through a bug like a new HttpClient per request, or no connection reuse — opens thousands of outbound TCP connections. App Service maps each to a SNAT port from a finite pool (about 128 pre-allocated per instance, with bounded on-demand expansion). Exhaust it and new outbound connections fail — surfacing as intermittent 5xx, dependency timeouts, and 502s, under load not at rest, which is why it passes in test and dies in production.

Confirm. In Diagnose and solve problemsSNAT Port Exhaustion, the tile shows allocated vs failed SNAT connections. Via metrics:

# SnatConnectionCount with the 'Failed' dimension — any non-zero Failed is the smoking gun
az monitor metrics list \
  --resource $(az webapp show -n app-shop-prod -g rg-shop-prod --query id -o tsv) \
  --metric SnatConnectionCount \
  --interval PT1M --aggregation Total

In App Insights, dependency calls to the same host spiking in failures under load corroborates it:

dependencies
| where timestamp > ago(1h) and success == false
| summarize failed=count() by target, type
| order by failed desc

Fix. The real fix is in code: reuse connections — a single shared HttpClient/IHttpClientFactory, pooled DB drivers, Keep-Alive. Architecturally, attach a NAT Gateway to a VNet-integrated subnet (a far larger SNAT pool), or use Private Endpoints for Azure PaaS targets (traffic stays on the backbone, no SNAT). Scaling out adds 128 ports per instance — a band-aid, not a fix.

# VNet-integrate the app, then the platform/NAT Gateway handles outbound at scale
az webapp vnet-integration add --name app-shop-prod --resource-group rg-shop-prod \
  --vnet vnet-shop --subnet snet-appsvc-integration

The real numbers behind the SNAT pool, and what each mitigation buys you:

Mechanism SNAT ports available Setup effort Cost impact Notes / limit
Default (no VNet integration) ~128 pre-allocated per instance, bounded on-demand expansion None None Shared platform IP; the constraint you usually hit
Scale out instances +~128 per added instance Slider / autoscale Linear per-instance cost Band-aid; masks a connection-reuse bug
Connection reuse (code) Same ports, far fewer used Code change None The actual fix — cuts outbound connections ~90%+
VNet integration + NAT Gateway Up to ~64,512 ports per attached public IP (×16 IPs) Subnet + NAT GW Small hourly + per-GB Massive headroom; decouples from instance count
Private Endpoints (PaaS targets) N/A — traffic bypasses SNAT Per target Per endpoint hourly DB/Storage/etc. stay on the backbone, no SNAT

A worked sizing example: at 1,800 requests/second with a new HttpClient per request and a ~4-minute TCP TIME_WAIT, you can have hundreds of thousands of sockets in flight against a per-instance pool of ~128. That is why it fails instantly under flash-sale load and never in a unit test.

Anatomy of a 503 Service Unavailable

A 503 means the front end had no healthy worker to hand the request to. Five causes — scan, then read the matching detail:

# 503 cause Tell-tale signal Confirm with Real fix
1 Platform restart / recycle in progress Brief 503 on a single instance during patch/deploy Diagnose and solve → Application Restarts Run ≥2 instances; deploy via slot-swap
2 Plan over-commit / scale-out limit Plan CPU/RAM pinned; instance count flat Plan metrics; autoscale max Spread apps; scale up SKU; raise max-count
3 Stray app_offline.htm 503 for everyone after a deploy, redeploy doesn’t help Kudu ls /home/site/wwwroot Delete file; run-from-package
4 Free/Shared daily quota exceeded Dev app 503s every afternoon, recovers at midnight sku.tier = Free/Shared; quota detector Move to B1+
5 Health-check eviction of all instances Whole app 503s when a downstream blips Health check blade; /healthz KQL Shallow health path; raise max-ping-failures

Cause 1 — Platform restart or recycle in progress

The platform is patching/migrating the worker, or your app is recycling (deployment, config change, scale op). For the seconds the worker is down with no other instance to absorb traffic, you get 503. On a single-instance plan this is unavoidable downtime on every restart.

Confirm. Diagnose and solve problemsApplication Restarts shows the events and their cause (Platform Initiated, User Initiated, Configuration Change). The activity log corroborates:

az monitor activity-log list --resource-group rg-shop-prod \
  --offset 6h --query "[?contains(operationName.value,'restart') || contains(operationName.value,'sites')].{time:eventTimestamp, op:operationName.value, status:status.value}" \
  -o table

Fix. Run at least two instances so a restart of one never zeroes out capacity, and use deployment slots with swap so config changes warm a staging instance before it takes traffic — a swap should be near-zero-downtime, an in-place restart is not. The restart triggers, who initiates them, and whether you can avoid the downtime:

Restart trigger Cause field in detector Avoidable? How to make it invisible
Platform patching / migration Platform Initiated No (platform-driven) ≥2 instances so one stays up
App setting / config change Configuration Change Yes Change in a slot, then swap
Manual az webapp restart User Initiated Yes Avoid in-place; prefer slot-swap
Deployment Deployment Partly Run-from-package + slot-swap with warm-up
Failed health check Health Check Yes Honest health path; tune max-ping-failures
Scale-in (autoscale removing an instance) Autoscale Yes Graceful shutdown handling; drain

Cause 2 — Plan over-commit and scale-out limits

Too many apps on one plan, or an app starved of CPU/memory, means the front end can’t get a responsive worker → 503 under load. Or autoscale’s maximum instance count is too low (or you hit the SKU ceiling) and demand outruns supply.

Confirm. Look at plan-level CPU and memory and the instance count:

# Plan CPU% and memory% — sustained high = over-committed
PLAN_ID=$(az appservice plan show -n plan-shop-prod -g rg-shop-prod --query id -o tsv)
az monitor metrics list --resource "$PLAN_ID" \
  --metric CpuPercentage MemoryPercentage --interval PT1M --aggregation Average -o table

# Current autoscale rules and max instances
az monitor autoscale show --name autoscale-shop --resource-group rg-shop-prod \
  --query "{min:profiles[0].capacity.minimum, max:profiles[0].capacity.maximum}" -o json

Fix. Spread apps across more plans (don’t pack 30 apps onto one P1v3), scale up the SKU for more per-instance CPU/RAM, and scale out with a sane maximum. Raise the autoscale ceiling:

az monitor autoscale update --name autoscale-shop --resource-group rg-shop-prod \
  --max-count 10 --min-count 2
resource autoscale 'Microsoft.Insights/autoscalesettings@2022-10-01' = {
  name: 'autoscale-shop'
  location: location
  properties: {
    targetResourceUri: plan.id
    enabled: true
    profiles: [ {
      name: 'default'
      capacity: { minimum: '2', maximum: '10', default: '2' }
      rules: [ {
        metricTrigger: {
          metricName: 'CpuPercentage'
          metricResourceUri: plan.id
          timeGrain: 'PT1M'
          statistic: 'Average'
          timeWindow: 'PT5M'
          timeAggregation: 'Average'
          operator: 'GreaterThan'
          threshold: 70
        }
        scaleAction: { direction: 'Increase', type: 'ChangeCount', value: '1', cooldown: 'PT5M' }
      } ]
    } ]
  }
}

Cause 3 — app_offline.htm present

.NET deployments drop an app_offline.htm file into the site root to take the app offline during a deploy. If a deploy is interrupted or the file is left behind, the app stays “offline” and serves 503 to everyone. People redeploy three times and never look at the file.

Confirm. In the Kudu/SCM console (https://<app>.scm.azurewebsites.net → Debug console), list the site root:

# Browse to /home/site/wwwroot and look for the stray file
ls -la /home/site/wwwroot/app_offline.htm

Fix. Delete the stray app_offline.htm; fix the pipeline that left it (a failed ZipDeploy/partial publish). Prefer run-from-package (WEBSITE_RUN_FROM_PACKAGE=1) for atomic, immutable deploys that can’t leave partial-state files.

Cause 4 — Quota exceeded on Free / Shared tiers

F1 (Free) and D1 (Shared) plans have hard daily quotas — chiefly CPU minutes/day (F1 ≈ 60). Blow it and App Service stops the app for the rest of the day, serving 503 (“quota exceeded”) until the midnight-UTC reset. Dev apps mysteriously die every afternoon.

Confirm. Diagnose and solve reports quota status; or check the tier:

az appservice plan show -n plan-shop-dev -g rg-shop-dev \
  --query "{sku:sku.name, tier:sku.tier}" -o table
# Free/Shared → daily CPU/memory quotas apply, reset at 00:00 UTC

Fix. No production fix exists on Free/Shared — they’re for experiments. Move to B1+ (no daily CPU quota, gains Always On):

az appservice plan update --name plan-shop-dev --resource-group rg-shop-dev --sku B1

The Free/Shared quotas that stop your app, and when they reset:

Quota F1 (Free) D1 (Shared) Reset Symptom when exceeded
CPU minutes / day ~60 min ~240 min 00:00 UTC 503 “quota exceeded” until reset
Memory ~1 GB shared, capped ~1 GB shared, capped rolling App stopped on breach
Outbound data / day ~165 MB unlimited (fair use) 00:00 UTC Outbound blocked
Always On Not available Not available n/a Idle unload → cold starts
Scale-out Not available Not available n/a No HA; every restart is a 503
Custom domain TLS Not available Not available n/a No production TLS

Cause 5 — Health-check eviction of unhealthy instances

The Health check path you configure is probed on every instance; one that fails for the configured window is removed from rotation and later replaced. The feature working as intended — but if all instances fail the probe (because the health path depends on a downed database, or returns 500), every instance is evicted and the front end has nothing healthy → 503 across the board. A too-strict health check takes your whole app offline.

Confirm. The Health check blade shows per-instance status; App Insights shows the path returning non-2xx:

requests
| where timestamp > ago(30m) and url endswith "/healthz"
| summarize total=count(), failures=countif(success == false) by bin(timestamp, 1m), cloud_RoleInstance
| order by timestamp desc

Fix. Make the health path shallow and honest: return 200 if this instance can serve, and don’t hard-fail on a briefly-unavailable optional downstream (or you evict every instance at once). Separate liveness from readiness. Configure the check:

az webapp config set --name app-shop-prod --resource-group rg-shop-prod \
  --generic-configurations '{"healthCheckPath": "/healthz"}'
resource site 'Microsoft.Web/sites@2023-12-01' = {
  name: 'app-shop-prod'
  location: location
  properties: {
    serverFarmId: plan.id
    siteConfig: {
      healthCheckPath: '/healthz'
      // App setting controls how long an unhealthy instance stays before replacement
      appSettings: [
        { name: 'WEBSITE_HEALTHCHECK_MAXPINGFAILURES', value: '10' }
      ]
    }
  }
}

WEBSITE_HEALTHCHECK_MAXPINGFAILURES (valid range 2–10) controls how many consecutive failures before an instance is replaced — raise it if transient blips are causing premature eviction. Here is the complete health-check knob set and how to reason about each:

Setting / control What it does Default Valid range / values When to change
healthCheckPath Path probed per instance unset (disabled) any path returning 200 when healthy Always set in prod; keep it shallow
WEBSITE_HEALTHCHECK_MAXPINGFAILURES Consecutive fails before instance is replaced 10 2–10 Lower for fast eviction; higher to ride blips
WEBSITE_HEALTHCHECK_MAXUNHEALTHYWORKERPERCENT Cap % of instances removed at once 50 1–100 Prevent evicting the whole fleet on a shared dependency
Probe interval How often the platform pings ~1 min platform-managed Not directly tunable
Liveness vs readiness Whether the path checks “alive” or “can serve” your design your design Separate them: never fail liveness on optional deps

Design rule for the path itself — what to include and what to never include:

Health path returns 200 when… Include in the check Never include
The process is up and the runtime is healthy In-process self-checks (config loaded, threadpool ok) A call to an external payment API
A required dependency is reachable (DB the app cannot serve without) A fast, cached DB ping A slow aggregate query or report
The instance can serve a real request Cheap synthetic request path Anything that itself can hang
Optional/best-effort downstreams (cache, search)

Cold starts and the slow first request

A cold start is latency on the first request to a worker with no warm process. Not an error — but a 30-second first request feels like an outage and can trip upstream timeouts into a 502. First, the four ways a worker ends up cold and what fixes each:

Cold-start trigger When it happens The fix Tier required
Idle unload (~20 min no traffic) Low-traffic apps overnight Always On = true B1+
Just deployed After every deploy/restart Slot-swap with warm-up S1+ (5 slots) / B1 (limited)
Scaled out (new instance) Autoscale adds an instance under load Pre-warmed instances P1v3+
Swapped in without warm-up After a slot swap WEBSITE_SWAP_WARMUP_PING_PATH B1+ (slots vary by tier)

Always On — the single most important setting

By default App Service unloads an idle app after about 20 minutes of no requests, and the next request pays full cold start. Always On sends a periodic internal request that keeps a warm worker resident, so users never hit a cold process from idleness.

# Always On requires Basic (B1) or higher — NOT available on Free/Shared
az webapp config set --name app-shop-prod --resource-group rg-shop-prod --always-on true
siteConfig: {
  alwaysOn: true   // requires B1+ ; silently unavailable on F1/D1
}

Confirm it’s actually on (a frequent surprise — it defaults off):

az webapp config show -n app-shop-prod -g rg-shop-prod --query alwaysOn -o tsv

Pre-warmed instances on Premium (and the scale-out cold start)

Even with Always On, scaling out to a new instance exposes a cold worker to the first requests that land on it. Premium v3 (Pv3) plans support pre-warmed instances — the platform keeps a configured number of buffer instances warm and ready before they take traffic, so scale-out doesn’t expose cold workers.

# Set the number of pre-warmed instances (Premium plans)
az webapp config set --name app-shop-prod --resource-group rg-shop-prod \
  --prewarmed-instance-count 2

ARR affinity — sticky sessions that can sabotage warmth

ARR affinity (ARRAffinity cookie) pins a client to one instance. Useful for legacy stateful apps, harmful for cold starts and scaling: traffic concentrates on a few instances, others stay cold, and a client stuck to a recycled instance pays repeated cold starts. For stateless apps (which yours should be), turn it off so the load balancer spreads load and keeps all instances warm.

az webapp update --name app-shop-prod --resource-group rg-shop-prod \
  --client-affinity-enabled false
resource site 'Microsoft.Web/sites@2023-12-01' = {
  name: 'app-shop-prod'
  properties: {
    clientAffinityEnabled: false   // disable the ARRAffinity sticky cookie for stateless apps
    serverFarmId: plan.id
  }
}

Deployment-slot warm-up before swap

When you swap a staging slot into production, the slot’s workers must be warm or the first production users hit a cold start. Slot warm-up sends requests to a configured path on the slot before completing the swap, only swapping once it responds healthily:

# Warm up the staging slot before swap completes
az webapp config appsettings set --name app-shop-prod --resource-group rg-shop-prod \
  --slot staging \
  --settings WEBSITE_SWAP_WARMUP_PING_PATH=/healthz WEBSITE_SWAP_WARMUP_PING_STATUSES=200
az webapp deployment slot swap --name app-shop-prod --resource-group rg-shop-prod \
  --slot staging --target-slot production

The swap is the warm-up mechanism: instances warm in staging keep their warmth through the swap, so production never goes cold. This is the reason to deploy via slot-swap rather than in-place.

What’s actually slow: JIT, DI, and image pull

Cold-start cost is concrete: .NET/JVM JIT compilation (the runtime compiles IL/bytecode to native on first execution — reduce with .NET ReadyToRun/trimming or JVM tiered compilation); DI container build + first-use init (building the DI graph, priming the first DB connection, resolving config including Key Vault references); and for custom containers an image pull and start (a 2 GB image is a slow docker pull before the first byte). You don’t eliminate the work — you ensure a warm worker has already paid for it before a user arrives. Keep images small, use a same-region registry, and enable Always On so the pull happens at deploy, not on idle wake.

The full menu of cold-start mitigations, ranked by what they cost and how much effort they take:

Technique What it does Cost Effort Covers which trigger
Always On Keeps a warm worker resident Free (B1+ already paid) Trivial (one flag) Idle unload
Disable ARR affinity Spreads load so all instances stay warm Free Trivial Idle/uneven warmth
Slot-swap with warm-up Production never serves a cold worker post-deploy Slot cost (S1+) Low Deploy
Pre-warmed instances Buffer instances ready before traffic Premium v3 SKU Low Scale-out
Smaller container image Faster image pull on cold instances Free (build effort) Medium Scale-out / deploy
Same-region ACR Cuts pull latency/egress Negligible Low Scale-out / deploy
ReadyToRun / AOT / trimming Less JIT at startup Free (larger artifact) Medium Every cold start
Raise upstream timeout Stops cold start tripping a 502 Free Trivial Masks, doesn’t fix

Restart loops: when the worker can’t stay up

A restart loop is the platform recycling the worker repeatedly because it dies or never becomes healthy. The app flaps; users see alternating 502/503. Each cause has a distinct fingerprint — match yours:

# Restart-loop cause Fingerprint in the logs Confirm with Real fix
1 Failing health-check path Perpetual unhealthy; every instance evicted Health check blade; /healthz KQL Shallow path; raise max-ping-failures
2 Bad app setting Identical startup exception every recycle az webapp log tail; diff app settings Correct setting; deploy via slot
3 Key Vault reference failure No app exception; secret-backed value empty Environment variables blade (red error) Fix identity/RBAC/firewall/secret/URI
4 OOM against SKU memory ceiling Memory ~100% right before each recycle MemoryWorkingSet Maximum; Memory Analysis Fix leak or scale up RAM
5 Crashing container “Container exited” repeating with exit code default_docker.log Fix entrypoint; PID 1 binds 0.0.0.0:$PORT

Failing health-check path

If the health check path returns non-200, the instance is marked unhealthy, evicted, and replaced — and the replacement fails the same probe, looping forever. A health path that depends on a down dependency turns a dependency outage into a total outage via eviction.

Confirm. Health check blade shows perpetual unhealthy; the /healthz KQL (above) shows steady non-2xx. Fix: make the health path shallow; raise WEBSITE_HEALTHCHECK_MAXPINGFAILURES to ride out transient blips.

A bad app setting

A single malformed app setting — a typo’d connection string, a feature flag the app refuses to start without, a wrong environment name — crashes startup on every boot. Because settings are injected as env vars, a bad one is a boot-time landmine.

Confirm. az webapp log tail shows the same startup exception on every recycle, naming the value it choked on; diff az webapp config appsettings list. Fix: correct the setting and redeploy via slot so it’s caught in staging; treat settings as code (Bicep, reviewed).

Key Vault reference failures at boot

App settings can be Key Vault references@Microsoft.KeyVault(SecretUri=https://kv-shop.vault.azure.net/secrets/db-conn/) — resolved at startup via the app’s managed identity. If resolution fails — identity not enabled, no access policy / RBAC role, vault firewall blocking, secret deleted/disabled, or wrong URI — the reference resolves to nothing, the app gets an empty value, and it crash-loops. The app never sees “Key Vault denied me”; it sees a broken connection string.

Confirm. List references and their resolution status:

# Shows each Key Vault reference and whether it 'Resolved' or has an error
az webapp config appsettings list --name app-shop-prod --resource-group rg-shop-prod \
  --query "[?contains(value, 'KeyVault')]" -o json

In the portal, Environment variables shows each reference with a green tick (Resolved) or a red error and reason. Confirm the identity and its access:

# Is a managed identity enabled?
az webapp identity show --name app-shop-prod --resource-group rg-shop-prod -o json

# Does that identity have get/list on secrets? (RBAC model)
PRINCIPAL=$(az webapp identity show -n app-shop-prod -g rg-shop-prod --query principalId -o tsv)
az role assignment list --assignee "$PRINCIPAL" \
  --scope $(az keyvault show -n kv-shop --query id -o tsv) -o table

Fix. Enable the managed identity and grant it Key Vault Secrets User (RBAC) or a get-secret access policy; ensure the vault firewall allows trusted Azure services / the app’s outbound; verify the secret exists and is enabled; verify the URI (a trailing version or a wrong vault name breaks it).

az webapp identity assign --name app-shop-prod --resource-group rg-shop-prod
az role assignment create --assignee "$PRINCIPAL" \
  --role "Key Vault Secrets User" \
  --scope $(az keyvault show -n kv-shop --query id -o tsv)
// Grant the app's system-assigned identity read on the vault, then reference a secret
resource kvRole 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(site.id, kv.id, 'kv-secrets-user')
  scope: kv
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      '4633458b-17de-408a-b874-0445c86b69e6') // Key Vault Secrets User
    principalId: site.identity.principalId
    principalType: 'ServicePrincipal'
  }
}

Every distinct way a Key Vault reference fails, and the one check that proves each:

Failure mode What the app sees How to confirm Fix
No managed identity enabled Empty value → crash az webapp identity show returns null az webapp identity assign
Identity lacks RBAC/access policy Empty value → crash az role assignment list --assignee <principalId> empty Grant Key Vault Secrets User
Vault firewall blocks the app Resolution times out → empty Vault networking shows “selected networks” only Allow trusted services / app subnet
Secret deleted or disabled Empty value → crash Secret missing/disabled in vault Restore/enable the secret
Wrong SecretUri (typo / stale version) Empty value → crash Environment variables blade red error Correct the URI (drop pinned version)
Soft-delete purge / wrong vault name 404 on resolve URI host doesn’t match the vault Point to the right vault

Out-of-memory against the plan’s memory limit

Each SKU has a per-instance memory ceiling (B1 ≈ 1.75 GB, P1v3 ≈ 8 GB). An app that leaks memory or needs more than the SKU offers gets OOM-killed and recycled — repeatedly under load. It looks like a random restart loop but it’s deterministic against the ceiling.

Confirm. Instance memory pinned near 100% right before each recycle:

az monitor metrics list \
  --resource $(az webapp show -n app-shop-prod -g rg-shop-prod --query id -o tsv) \
  --metric MemoryWorkingSet --interval PT1M --aggregation Maximum -o table

Diagnose and solve problemsMemory Analysis correlates recycles with pressure and can capture a dump. Fix: fix the leak (capture a dump via Kudu / the Collect a Memory Dump detector), or scale up to more RAM (P1v3 8 GB, P2v3 16 GB). Scaling out does not help a per-instance OOM — each instance hits the same ceiling.

A crashing container

A custom container that exits (bad entrypoint, missing dependency, fatal startup error, process not PID 1 / not handling signals) gets restarted on a loop — it “starts” then immediately exits.

Confirm. default_docker.log shows the container starting and exiting repeatedly, often with the exit code and entrypoint stderr:

az webapp log tail --name app-api-prod --resource-group rg-shop-prod
# Look for repeating: "Starting container" → process output → "Container exited" → restart

Fix. Reproduce locally (docker run -e WEBSITES_PORT=8000 -p 8000:8000 yourimage); fix the entrypoint; run the main process in the foreground as PID 1 binding 0.0.0.0:$PORT. A container that runs locally but loops on App Service is almost always the port/bind contract (Cause 2) or a missing app setting.

The diagnostic toolkit: exact paths

Knowing where to look is half the battle. First, the tools matrix — what each shows, how to reach it, and what it’s best for — then the detail on each:

Tool What it shows How to access Best for
az webapp log tail Live stdout/stderr + platform messages CLI / Cloud Shell Crashes, container loops, port-probe lines
Kudu / SCM console File system, processes, shell https://<app>.scm.azurewebsites.net Stray app_offline.htm, default_docker.log, process truth
Diagnose and solve problems Pre-correlated detectors App blade → Diagnose and solve Fast root-cause hypothesis (restarts, SNAT, CPU/mem)
App Insights — Failures Exceptions/failed requests + dependencies grouped App Insights resource → Failures The exact failing operation and stack
App Insights — Live Metrics Real-time req/failure rate, CPU, mem, live exceptions App Insights → Live Metrics Watching an active incident unfold
Health check blade Per-instance health status App blade → Health check Which instances are unhealthy and why
Metrics Explorer SNAT, CPU, memory, 5xx, response time App / plan → Metrics Trends, alerts, correlating to recycles
default_docker.log Container start/port/exit story az webapp log download / Kudu The authoritative container narrative
Activity log Control-plane operations (restart, scale, config) Subscription / RG → Activity log “Who changed what and when”
az webapp log download Zipped filesystem + container logs CLI Offline analysis; sharing with support

az webapp log tail — live application + platform log stream. Your first move for crashes and container loops; streams stdout/stderr and platform messages in real time. Enable filesystem logging first if you see nothing:

az webapp log config --name app-shop-prod --resource-group rg-shop-prod \
  --application-logging filesystem --level information --docker-container-logging filesystem
az webapp log tail --name app-shop-prod --resource-group rg-shop-prod

Kudu / SCM console — the file system and process truth. https://<app>.scm.azurewebsites.net (portal: Advanced Tools → Go). Browse /home/site/wwwroot (find a stray app_offline.htm), read /home/LogFiles (default_docker.log, eventlog.xml), run a shell on the worker, inspect the running process. On Linux, the SSH option (/webssh/host) drops you into the container.

Diagnose and solve problems — the guided detectors. The app blade → Diagnose and solve problems runs Microsoft’s detectors over your telemetry. The category you’ll live in is Availability and Performance (Web App Down, Application Crashes, High CPU/Memory, SNAT Port Exhaustion, Application Restarts). Fastest route to a root-cause hypothesis — the detectors already correlate restarts, metrics and exceptions for you. The detectors you’ll actually use, mapped to the symptom they crack:

Detector Category Cracks which symptom What it correlates
Web App Down Availability & Performance Total outage / 5xx spike Availability, restarts, exceptions
Application Crashes Availability & Performance 502 from worker crash Crash dumps, exit codes, exceptions
Application Restarts Availability & Performance 503 from recycle/loop Restart events + their cause
SNAT Port Exhaustion Availability & Performance 502/timeouts under load Allocated vs failed SNAT connections
Memory Analysis Availability & Performance OOM restart loop Memory pressure vs recycles; dump capture
High CPU Analysis Availability & Performance Slow/503 under load CPU per instance vs throttling
TCP Connections Diagnostics Outbound dependency failures Open/failed outbound connections

Application Insights — Failures + Live Metrics. The richest tool. Failures groups exceptions and failed requests/dependencies by type and shows the exact failing operation and stack. Live Metrics streams request/failure rate, CPU, memory and live exceptions in real time — invaluable during an active incident. Wire it up via the connection string:

az webapp config appsettings set --name app-shop-prod --resource-group rg-shop-prod \
  --settings APPLICATIONINSIGHTS_CONNECTION_STRING="InstrumentationKey=...;IngestionEndpoint=..."

The KQL you’ll reach for most:

// All failed requests in the last 30 min with status code + the operation
requests
| where timestamp > ago(30m) and success == false
| summarize count() by resultCode, operation_Name, cloud_RoleInstance
| order by count_ desc

The KQL cheat-sheet — one query per question you’ll ask in an incident:

Question Table Key columns One-liner
Which requests are failing and where? requests resultCode, operation_Name, cloud_RoleInstance where success == false | summarize count() by ...
What’s actually throwing? exceptions problemId, outerMessage, operation_Name summarize count() by problemId
Which dependency is failing under load? dependencies target, type, success where success == false | summarize count() by target
Is one instance worse than the rest? requests cloud_RoleInstance summarize count() by cloud_RoleInstance
Are requests slow (cold start / timeout)? requests duration, timestamp summarize percentile(duration,95) by bin(timestamp,1m)
Is the health path failing? requests url, success where url endswith "/healthz" | summarize ...

Health check configuration. The Health check blade (or healthCheckPath in config) — set a path, watch per-instance health, tune WEBSITE_HEALTHCHECK_MAXPINGFAILURES (2–10). This is both a diagnostic (which instances are unhealthy) and a control (eviction behaviour).

Container logs. For Linux/containers, default_docker.log (via az webapp log download or Kudu//home/LogFiles) is authoritative for the start/port/exit story. The platform’s pings, the “didn’t respond on port” line, and the container’s own stdout all land here. The log files you’ll open, and what each is the source of truth for:

Log file / location Lives in Source of truth for
default_docker.log /home/LogFiles Container start, port probe, exit code
eventlog.xml /home/LogFiles (Windows) ANCM 500.3x startup failures
<app>_docker.log / stdout logs /home/LogFiles Your app’s stdout/stderr
LogFiles/Application /home/LogFiles/Application Filesystem application logs (when enabled)
LogFiles/http/RawLogs /home/LogFiles/http Raw HTTP/W3C access logs
deployments/ /home/site/deployments Last deploy status / failure

The complete app-settings reference

Half of these incidents are one app setting away from fixed. This is the canonical reference — what each controls, its default, valid values, and when you actually change it. Keep it open while you read az webapp config appsettings list:

Setting What it controls Default Valid range / values When to change
WEBSITES_PORT Port the platform probes on a custom container 80 any TCP port your app binds Always, for custom containers not on 80
WEBSITES_CONTAINER_START_TIME_LIMIT Seconds to wait for container start 230 1–1800 Heavy containers; treat long starts as a smell
WEBSITE_HEALTHCHECK_MAXPINGFAILURES Consecutive fails before instance replaced 10 2–10 Lower for fast eviction; higher to ride blips
WEBSITE_HEALTHCHECK_MAXUNHEALTHYWORKERPERCENT Max % of fleet removed at once 50 1–100 Stop a shared dependency evicting everything
WEBSITE_SWAP_WARMUP_PING_PATH Path pinged on a slot before swap completes unset any path Always set for zero-cold-start swaps
WEBSITE_SWAP_WARMUP_PING_STATUSES Status codes that count as “warm” 200 comma list e.g. 200,202 When your warm path returns non-200
WEBSITE_RUN_FROM_PACKAGE Run from an immutable package mount 0 1, or a package URL Atomic deploys; prevents partial-state files
WEBSITE_DNS_SERVER Custom DNS for outbound resolution platform e.g. 168.63.129.16 Private DNS / private endpoint resolution
WEBSITE_VNET_ROUTE_ALL Route all outbound through the VNet 0 0 / 1 Force egress via NAT GW / firewall
WEBSITES_ENABLE_APP_SERVICE_STORAGE Mount persistent /home for containers true (built-in) / false (custom) true / false Containers needing shared persistent storage
APPLICATIONINSIGHTS_CONNECTION_STRING App Insights ingestion target unset connection string Always, in production
WEBSITE_TIME_ZONE Process time zone UTC TZ name App logs/schedules in local time
SCM_DO_BUILD_DURING_DEPLOYMENT Build on the Kudu side during deploy varies true / false Oryx build vs pre-built artifact
WEBSITE_LOAD_CERTIFICATES Load certs into the worker store unset thumbprint list / * Client-cert / mTLS to downstreams

A short note on precedence, because it bites: an app setting overrides the same key in your app’s config file, and a slot-specific (sticky) setting stays with the slot through a swap. A value that looks wrong in code is often correct in the platform — always diff the live settings, not the repo.

App Service plan tiers and what each fixes

The plan SKU is not just “more power” — specific tiers unlock specific fixes. Match the failure to the tier.

Tier vCPU / RAM (approx) What it fixes for these problems Notable limits
F1 (Free) Shared / 1 GB Nothing production. Demos only. Daily CPU quota (~60 min), no Always On, no custom-domain TLS, no slots, no scale-out → 503s by design
D1 (Shared) Shared / 1 GB Slightly more than Free Daily quotas, no Always On, no scale-out
B1–B3 (Basic) 1–4 vCPU / 1.75–7 GB Always On, no daily CPU quota, custom domains + TLS. Kills cold-start-from-idle and Free/Shared quota 503s. No autoscale (manual scale only), limited slots, modest RAM (OOM risk on heavy apps)
S1–S3 (Standard) 1–4 vCPU / 1.75–7 GB Autoscale, up to 5 deployment slots, daily backups. Fixes scale-out 503s and enables slot-swap warm-up. RAM still modest; no pre-warmed instances
P1v3–P3v3 (Premium v3) 2–8 vCPU / 8–32 GB More RAM (fixes OOM loops), pre-warmed instances (fixes scale-out cold start), up to 20 slots, better price/perf, VNet integration. The production default. Higher cost; still finite SNAT without NAT Gateway
I (Isolated v2 / ASE) Dedicated, large Network isolation in a dedicated App Service Environment, very high scale, private by default. Fixes hard isolation/compliance needs. Highest cost; operational overhead

The same tiers, read as a capability grid against the features that fix these incidents:

Capability F1 D1 B1–B3 S1–S3 P1v3–P3v3 Isolated v2
Always On No No Yes Yes Yes Yes
Daily CPU quota Yes (~60m) Yes No No No No
Manual scale-out No No Yes (≤3) Yes Yes Yes (high)
Autoscale No No No Yes Yes Yes
Deployment slots 0 0 limited 5 20 20
Pre-warmed instances No No No No Yes Yes
Max RAM per instance ~1 GB ~1 GB ~7 GB ~7 GB ~32 GB large
VNet integration No No Yes (regional) Yes Yes Native
Custom-domain TLS No No Yes Yes Yes Yes

And the decision rule as a table — match the symptom to the smallest tier that fixes it:

If you’re seeing… It’s gated by… Smallest tier that fixes it
Afternoon 503s on a dev app Free/Shared daily quota, no Always On B1
Cold start after idle Always On unavailable B1
503 spikes at peak, can’t autoscale No autoscale on Basic S1
Cold workers right after scale-out No pre-warmed instances P1v3
OOM restart loops on a heavy app RAM ceiling too low P1v3 / P2v3
Need private-only inbound + isolation Shared infrastructure Isolated v2 (ASE)
Outbound SNAT pain under load Not a tier problem Add NAT Gateway via VNet integration

The decision rule in prose: if you’re seeing Free/Shared quota 503s or no Always On, go B1+. If you’re seeing scale-out 503s, go Standard+ for autoscale. If you’re seeing OOM restart loops or scale-out cold starts, go Premium v3 for the RAM and pre-warmed instances. If you need outbound at scale without SNAT pain, the tier matters less than adding a NAT Gateway via VNet integration. Tiers above the bug don’t fix the bug — a P3v3 still crash-loops on a bad Key Vault reference.

Architecture at a glance

The diagram traces the request as it actually flows, then shows the cause and diagnostic move for each of the four symptom classes. Read it left to right: an HTTP request enters App Service and lands on the front end (ARR), which must pick a healthy worker and proxy to it. From there it fans into four columns. The 502 Bad Gateway column traces to a worker process crashed or timed out, pointing you at App Service logs and App Insights Failures. The 503 Unavailable column traces to a plan at its scale limit or still warming up (diagnose via plan CPU and instance count). The slow first request column is a cold start with Always On off (fix by enabling Always On and pre-warming). The restart loop column is bad config or a failing liveness probe (confirm via deployment logs and the health-check path).

Notice the shared footer: every path converges on the same three instruments — az webapp log tail, Diagnose and solve problems, and Application Insights → Failures. That’s the whole method: localise the symptom to a column, read the cause, run the named diagnostic, apply the fix. The first question on every incident is “is the front end getting a bad answer (502) or no answer (503)?” — the column you land in tells you which logs to open first.

Azure App Service request flow from an HTTP request through the ARR front end to a worker, fanning into four failure columns — 502 Bad Gateway from a crashed or timed-out worker process (diagnose via App Service logs and App Insights failures), 503 Service Unavailable from a plan at its scale limit or warming up (diagnose via plan CPU and instance count), a slow first request from a cold start with Always On off (fix by enabling Always On and pre-warming), and a restart loop from bad config or a failing liveness probe (diagnose via deployment logs and the health-check path) — all converging on the diagnostic tools az webapp log tail, Diagnose and solve problems, and Application Insights Failures

Real-world scenario

Lumio Retail runs its e-commerce checkout API on Azure App Service: a Linux custom container (.NET 8) on a single S1 Standard plan in Central India, fronted by Application Gateway with WAF. Traffic averages 400 requests/second with a 6pm spike to ~1,800 rps during flash sales. The platform team is four engineers; the monthly App Service spend is about ₹18,000.

The incident began on a Friday flash sale. At 18:03 the WAF dashboard lit up with 502 Bad Gateway — about 12% of checkout calls failing, climbing to 30% by 18:10. The on-call engineer’s reflex: restart the app. It helped for ninety seconds, then 502s resumed. Second reflex: scale up S1 → P1v3. The error rate dropped to ~8% but didn’t clear, and the bill implication spooked the manager. Forty minutes in, revenue impact was real and the incident bridge was full.

The breakthrough came from asking the right first question. App Service’s own request logs showed checkout requests succeeding — completing in 70–110 seconds. The 502 wasn’t App Service’s; it was Application Gateway timing out the backend. az network application-gateway http-settings list showed requestTimeout: 60. Under flash-sale load the checkout call (which fanned out to a payment provider over a fresh HttpClient per request) was taking longer than 60 seconds and the gateway was cutting it. So there were two coupled bugs: a slow backend, and an upstream timeout shorter than the backend’s worst case.

The slow backend itself was SNAT port exhaustion. Diagnose and solve problems → SNAT Port Exhaustion showed SnatConnectionCount with a non-zero Failed dimension climbing from exactly 18:03. The per-request HttpClient to the payment provider had, under 1,800 rps, blown through the ~128 SNAT ports per instance; new outbound connections queued and timed out — which is why each checkout took 70–110 s, which is why the gateway 502’d. The restart “fixed” it momentarily by resetting connection state; scaling up to P1v3 helped a little via fresh SNAT pools.

The fix landed in two parts. That night: raise the App Gateway requestTimeout to 120 and scale out to 3 instances to triple the SNAT pool. The following week: replace the per-request HttpClient with a single IHttpClientFactory client (connection reuse cut outbound connections ~95%), and VNet-integrate with a NAT Gateway for a huge SNAT pool independent of instance count. The next flash sale ran at 1,900 rps with zero SNAT failures, checkout p95 fell from 90 s to 240 ms, and they moved back down to S1 + autoscale (max 4) at ₹16,500 — lower than before. The lesson on the wall: “A 502 behind a gateway is a question, not an answer. Ask whether App Service even saw the failure.”

The incident as a timeline, because the order of moves is the lesson:

Time Symptom Action taken Effect What it should have been
18:03 502 at 12%, climbing (alert fires) Ask: did App Service see the failure?
18:05 502 at 18% Restart the app +90 s relief, then recurs Don’t restart blind
18:12 502 at 30% Scale up S1 → P1v3 30% → 8%, cost spike Don’t scale up to mask
18:40 Still 8% Read App Service request logs Requests succeeding in 70–110 s This was the breakthrough
18:48 Root cause found Check App GW requestTimeout (60 s) + SNAT detector Two coupled bugs identified
19:05 Mitigated Timeout → 120, scale out to 3 502s clear Correct night-of fix
+1 week Fixed IHttpClientFactory + NAT Gateway; back to S1 0 SNAT fails, p95 240 ms, ₹16,500 The actual fix is code

Advantages and disadvantages

The managed-workers-behind-a-shared-front-end model both causes this class of problem and makes it diagnosable. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
Platform captures crash, restart and port-probe details automatically (Kudu, default_docker.log, detectors) — you rarely lack data The HTTP status you see (502/503) is the front end’s complaint, abstracting away the real cause; you must dig
Diagnose and solve problems detectors pre-correlate restarts, metrics and exceptions — root-cause hypothesis in one click The abstraction hides the worker; you can’t ssh to “the server” the way you would a VM — you work through Kudu/SCM
Always On, pre-warmed instances and slot warm-up are built-in fixes for cold starts — no custom tooling Defaults are unsafe for prod: Always On off, ARR affinity on, health check unset — you must turn knobs
Scaling out/up is a slider; autoscale handles 503-from-load automatically Scaling masks resource bugs (SNAT, OOM) — it “works” temporarily and hides the real fix, costing money
Shared front end + health check evict bad instances automatically, improving availability A bad health-check path evicts all instances and turns a dependency blip into a total outage
SNAT, CPU and memory are first-class metrics you can alert on Finite SNAT (~128/instance) is invisible until you hit it under load — passes in test, fails in prod
Key Vault references keep secrets out of config A failed Key Vault reference crash-loops the app with no obvious “denied” error — looks like a random restart

The model is right for standard web apps and APIs where you want to ship code, not operate servers, and built-in cold-start and scaling controls suffice. It bites hardest on chatty outbound workloads (SNAT), memory-heavy apps on small SKUs (OOM), and anyone who deploys with defaults and never tunes Always On / ARR affinity / health check. The disadvantages are all manageable — but only if you know they exist, which is the point of this article.

Hands-on lab

Reproduce a 502 from the WEBSITES_PORT bug, watch it in the logs, and fix it — all free-tier-friendly (we use B1; delete at the end). Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-appsvc-lab
LOC=centralindia
PLAN=plan-lab
APP=app-lab-$RANDOM   # globally-unique hostname
az group create -n $RG -l $LOC -o table

Step 2 — Create a B1 plan (Linux) so Always On is available.

az appservice plan create -n $PLAN -g $RG --is-linux --sku B1 -o table

Expected: a plan row, sku.name = B1, kind = linux.

Step 3 — Deploy a container that listens on 8080, but DON’T set WEBSITES_PORT (reproduce the bug). A public sample that binds 8080:

az webapp create -n $APP -g $RG -p $PLAN \
  --deployment-container-image-name mcr.microsoft.com/azuredocs/aci-helloworld:latest -o table

Step 4 — Hit the site and watch it fail. Browse to https://$APP.azurewebsites.net — you get a 502/“Application Error”. Confirm via the logs:

az webapp log config -n $APP -g $RG --docker-container-logging filesystem
az webapp log tail -n $APP -g $RG
# Watch for: "didn't respond to HTTP pings on port: 80, failing site start"

The platform probes port 80; the container (on 8080) never answers → the dead-giveaway line.

Step 5 — Fix it by declaring the real port.

az webapp config appsettings set -n $APP -g $RG --settings WEBSITES_PORT=8080
az webapp restart -n $APP -g $RG

Wait ~30–60 s, reload the URL. Expected: the “Welcome to Azure Container Instances!” page renders — 502 gone.

Step 6 — Turn on the production-safe defaults and a health check.

az webapp config set -n $APP -g $RG --always-on true --generic-configurations '{"healthCheckPath": "/"}'
az webapp update -n $APP -g $RG --client-affinity-enabled false
az webapp config show -n $APP -g $RG --query "{alwaysOn:alwaysOn, health:healthCheckPath}" -o table

Expected: alwaysOn: true, health: /.

Validation checklist. You reproduced a 502 purely from the port contract, identified it from default_docker.log’s port-probe line, fixed it with WEBSITES_PORT, and applied Always On + disabled ARR affinity + set a health check. No code involved — exactly the point. The lab steps mapped to what each proves:

Step What you did What it proves Real-world analogue
3 Deploy container, no WEBSITES_PORT The port contract is real and unforgiving First container deploy of any team
4 Watch default_docker.log The confirming log line exists and is specific The 90-second diagnosis
5 Set WEBSITES_PORT=8080 The one setting that fixes it The actual production fix
6 Always On + no ARR + health check Defaults are unsafe; you must tune them Hardening every new app

Cleanup (avoid lingering plan charges).

az group delete -n $RG --yes --no-wait

Cost note. A B1 plan is a few rupees per hour; an hour of this lab is well under ₹50, and deleting the resource group stops everything. (B1 has no free tier, but it’s the cheapest SKU with Always On.)

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you can read at 02:14, then the same entries with the full confirm-command detail underneath.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 Intermittent 502 under load, fine at rest, dependency calls timing out SNAT port exhaustion (no connection reuse) Diagnose and solve → SNAT Port Exhaustion; az monitor metrics list --metric SnatConnectionCount --aggregation Total (Failed > 0) IHttpClientFactory + pooled DB; VNet + NAT Gateway; Private Endpoints
2 New container deploy 502s immediately; logs say app started fine WEBSITES_PORT unset/wrong; or bound 127.0.0.1 az webapp log downloaddefault_docker.log: “didn’t respond to HTTP pings on port: 80” --settings WEBSITES_PORT=<port>; bind 0.0.0.0:$PORT
3 Heavy container 502/503 on cold instances, fine once warm Startup > WEBSITES_CONTAINER_START_TIME_LIMIT (230 s) default_docker.log gap “Starting” → “failing site start” > limit Raise to ≤1800; shrink image; same-region ACR; Always On
4 502 to client, App Service logs show request succeeding (slowly) Upstream timeout at Front Door / App Gateway App Insights duration vs az network application-gateway http-settings list --query "[].requestTimeout" (60 s) Speed up backend; raise upstream timeout; fast probe path
5 Fine, then 502/503 after a deploy; redeploy doesn’t help Stray app_offline.htm in wwwroot Kudu → Debug console → ls /home/site/wwwroot/app_offline.htm Delete file; WEBSITE_RUN_FROM_PACKAGE=1
6 Dev/test app 503s every afternoon, recovers ~midnight Free (F1)/Shared (D1) daily CPU quota az appservice plan show --query "{tier:sku.tier}" = Free/Shared; quota detector az appservice plan update --sku B1
7 Whole app 503s when a downstream DB blips Health path depends on the downstream → all instances evicted Health check blade all-unhealthy; /healthz KQL non-2xx across instances Shallow/honest path; raise WEBSITE_HEALTHCHECK_MAXPINGFAILURES
8 Restart loop after a config change; same exception every boot Bad app setting injected as env var az webapp log tail repeats the trace; diff az webapp config appsettings list Correct setting; deploy via slot-swap; settings as Bicep
9 Restart loop, no app exception; secret-backed settings empty Key Vault reference failed at boot Environment variables blade red error; az webapp config appsettings list --query "[?contains(value,'KeyVault')]"; az webapp identity show Enable identity; grant Key Vault Secrets User; fix firewall/secret/URI
10 Restart loop under load; instances die; memory ~100% OOM vs SKU memory ceiling (B1 ≈ 1.75 GB) az monitor metrics list --metric MemoryWorkingSet --aggregation Maximum; Memory Analysis Fix leak (dump) or scale up (P1v3 8 GB / P2v3 16 GB)
11 Custom container “starts” then exits, over and over Container crashes on startup (entrypoint, PID 1, signals, bind) az webapp log tail: “Starting container” → stderr → “Container exited” repeating docker run locally; fix entrypoint; PID 1 binds 0.0.0.0:$PORT
12 Slow first request after a quiet period or right after a swap Cold start (Always On off, or slot not warm) az webapp config show --query alwaysOn = false; App Insights slow first request after gaps --always-on true; disable ARR; pre-warmed instances; swap warm-up
13 After enabling autoscale, still 503 spikes at peak max-count too low / cooldown slow / SKU ceiling az monitor autoscale show; plan CPU pinned while instance count flat Raise --max-count; lower cooldown; scale up; pre-warm
14 403 to some callers after locking the app down Access restriction / private access blocking the caller Access restrictions blade; az webapp config access-restriction show Add caller IP/range; fix SCM rules / private routing

The expanded form, with the full reasoning for the entries that bite hardest:

1. Intermittent 502 under load, fine at rest, dependency calls timing out. Root cause: SNAT port exhaustion from non-reused outbound connections (new HttpClient/socket per request). Confirm: Diagnose and solve problems → SNAT Port Exhaustion; or az monitor metrics list --metric SnatConnectionCount --aggregation Total shows a non-zero Failed dimension under load. Fix: Reuse connections (IHttpClientFactory, pooled DB clients); VNet-integrate with a NAT Gateway; use Private Endpoints for Azure PaaS targets. Scaling out is a temporary band-aid (+128 ports/instance).

2. Brand-new container deploy returns 502 immediately; container logs say the app started fine. Root cause: Container listens on a port the platform isn’t probing — WEBSITES_PORT unset/wrong (default probe 80), or the app bound 127.0.0.1 instead of 0.0.0.0. Confirm: az webapp log downloaddefault_docker.log shows “didn’t respond to HTTP pings on port: 80, failing site start”. Fix: az webapp config appsettings set --settings WEBSITES_PORT=<real-port>; ensure the app binds 0.0.0.0:$PORT.

3. Heavy container 502/503s on cold instances, fine once warm. Root cause: Startup exceeds WEBSITES_CONTAINER_START_TIME_LIMIT (default 230 s) — slow image pull + init. Confirm: default_docker.log timestamps between “Starting container” and “failing site start” exceed the limit. Fix: Raise the limit (max 1800 s): --settings WEBSITES_CONTAINER_START_TIME_LIMIT=600; shrink the image; same-region ACR; enable Always On.

4. 502 from the client but App Service logs show the request succeeding (slowly). Root cause: Upstream timeout at Front Door / Application Gateway — backend slower than the front end’s timeout. Confirm: App Insights shows request duration near/over the gateway timeout; az network application-gateway http-settings list --query "[].requestTimeout" (default 60 s) or Front Door origin response timeout. Fix: Speed up the backend (the real fix); raise the upstream timeout to match a legitimately long op; point the health probe at a fast path.

5. App was fine, then started 502/503-ing after a deploy; redeploying doesn’t help. Root cause: A stray app_offline.htm left in wwwroot by an interrupted/partial .NET deployment. Confirm: Kudu/SCM → Debug console → ls /home/site/wwwroot/app_offline.htm. Fix: Delete the file; switch to run-from-package (WEBSITE_RUN_FROM_PACKAGE=1) for atomic deploys.

6. Dev/test app serves 503 every afternoon, recovers around midnight. Root cause: Free (F1) / Shared (D1) daily CPU quota exhausted; the app is stopped until the 00:00 UTC reset. Confirm: az appservice plan show --query "{tier:sku.tier}" returns Free/Shared; Diagnose and solve reports quota exceeded. Fix: Move to B1+ (az appservice plan update --sku B1) — no daily CPU quota and Always On.

7. The whole app goes 503 at once whenever a downstream DB has a blip. Root cause: Health check path depends on the downstream, so when it’s down every instance fails the probe and is evicted — total outage from a partial one. Confirm: Health check blade shows all instances unhealthy; KQL on /healthz shows non-2xx across all cloud_RoleInstance. Fix: Make the health path shallow/honest (liveness ≠ readiness); raise WEBSITE_HEALTHCHECK_MAXPINGFAILURES (2–10) to ride out blips; don’t hard-fail on optional dependencies.

8. App restart-loops right after a config change; the same exception every boot. Root cause: A bad app setting (typo’d connection string, missing required value) crashing startup, injected as an env var on every recycle. Confirm: az webapp log tail shows the identical startup stack trace each loop, naming the value; diff az webapp config appsettings list. Fix: Correct the setting; deploy via slot-swap so bad settings are caught in staging; manage settings as Bicep.

9. Restart loop with no app exception logged; secrets-backed settings look empty. Root cause: Key Vault reference failure at boot — managed identity missing, no RBAC/access policy on the vault, vault firewall blocking, or secret deleted/disabled — so the reference resolves to nothing. Confirm: Portal → Environment variables shows the reference with a red error; az webapp config appsettings list --query "[?contains(value,'KeyVault')]"; check az webapp identity show and az role assignment list --assignee <principalId>. Fix: Enable identity; grant Key Vault Secrets User; allow trusted services on the vault firewall; verify the secret exists/enabled and the SecretUri is correct.

10. Restart loop under load; instances die and come back; memory pinned near 100%. Root cause: OOM against the SKU’s per-instance memory ceiling (B1 ≈ 1.75 GB) — a leak, or the app simply needs more RAM. Confirm: az monitor metrics list --metric MemoryWorkingSet --aggregation Maximum; Diagnose and solve → Memory Analysis correlates recycles with pressure. Fix: Capture and analyse a memory dump (Kudu / Collect a Memory Dump detector) to fix the leak; or scale up to P1v3 (8 GB)/P2v3 (16 GB). Scaling out doesn’t help per-instance OOM.

11. Custom container “starts” then immediately exits, over and over. Root cause: Container crashes on startup — bad entrypoint, missing dependency, process not PID 1 / not handling signals, or it can’t bind the port. Confirm: az webapp log tail shows “Starting container” → stderr → “Container exited” repeating with an exit code. Fix: docker run locally with WEBSITES_PORT set to reproduce; fix the entrypoint; run the main process in the foreground as PID 1 binding 0.0.0.0:$PORT.

12. Slow first request after every quiet period (or right after a slot swap); subsequent ones fast. Root cause: Cold start — the idle app was unloaded (Always On off), or the swapped-in slot wasn’t warm, so a worker pays runtime boot + JIT + (for containers) image pull on the next request. Confirm: az webapp config show --query alwaysOn returns false; App Insights shows a slow first request after gaps or a spike immediately post-swap that tapers. Fix: az webapp config set --always-on true (B1+); disable ARR affinity; for scale-out cold starts use Premium pre-warmed instances; deploy via slot-swap with WEBSITE_SWAP_WARMUP_PING_PATH/WEBSITE_SWAP_WARMUP_PING_STATUSES.

13. After enabling autoscale you still get 503 spikes at peak. Root cause: Autoscale max-count too low, scale-out cooldown too slow to react, or you hit the SKU instance ceiling. Confirm: az monitor autoscale show --query "{min:...,max:...}"; plan CPU pinned high while instance count is flat. Fix: Raise --max-count; lower the rule’s cooldown; scale up the SKU; pre-warm so new instances aren’t cold when they arrive.

14. After locking the app to a front end, some callers get 403. Root cause: An access restriction (IP rule / private access / SCM lockdown) is blocking a legitimate caller — often the SCM site or a health-probe source. Confirm: Access restrictions blade; az webapp config access-restriction show -n app-shop-prod -g rg-shop-prod. Fix: Add the caller’s IP/range (or the gateway/front-door service tag); ensure SCM rules don’t block your deploy pipeline; verify private-endpoint DNS resolves.

Best practices

The alerts worth wiring before the next incident — the leading indicators, not the lagging “site down”:

Alert on Signal Threshold (starting point) Why it’s leading
SNAT failures SnatConnectionCount (Failed) > 0 sustained 5 min First sign of outbound exhaustion before 502s spike
Memory pressure MemoryWorkingSet % > 85% for 10 min Predicts OOM restart loop
CPU saturation CpuPercentage (plan) > 80% for 10 min Predicts 503-from-load before users feel it
5xx rate Http5xx > 1% of requests The symptom — alert but treat as confirmation
Health status Unhealthy instance count ≥ 1 for 5 min Catches eviction before the fleet drops
Response time HttpResponseTime p95 > your SLO Cold start / slow backend creeping toward timeout

Security notes

The security knobs that also prevent these incidents — secure and resilient pull in the same direction here:

Control Setting / mechanism Secures against Also prevents
Managed identity + KV references identity + @Microsoft.KeyVault(...) Secrets in plaintext config Hand-rolled secret rotation breaking the app
SCM access restrictions scmIpSecurityRestrictions Public access to the admin console Unauthorised deploys causing restart loops
HTTPS-only + min TLS httpsOnly, minTlsVersion Downgrade / cleartext “Temporary” TLS-off mistakes
Access restrictions (inbound) ipSecurityRestrictions / private endpoint Direct internet hits bypassing the WAF Probe-source confusion causing 403s (if scoped right)
Vault firewall + trusted services Key Vault networking Secret exfiltration KV-reference boot failures (when allow-listed correctly)
Image digest pinning + scanning ACR + digest in linuxFxVersion Tampered/unknown images Surprise breaking changes from a moved tag

Cost & sizing

The bill drivers and how they interact with the fixes:

A rough monthly picture for a small production API: 2× B1 (~₹9,000) or 2× S1 with autoscale to 4 (~₹12,000–16,000 at peak), plus App Insights (~₹1,000–3,000). Lumio landed at ₹16,500 after fixing the bug and right-sizing back down — proof the fix is usually code, not a bigger SKU. The cost drivers and what each one buys you:

Cost driver What you pay for Rough INR / month What it fixes Watch-out
1× B1 (continuous) One Basic instance, Always On ~₹4,000–5,000 Cold-start-from-idle, Free-quota 503 No autoscale, modest RAM
2× B1 (HA pair) Two instances, no single-point ~₹9,000 Restart 503s (one always up) Still no autoscale
2× S1 + autoscale to 4 Standard + slots + autoscale ~₹12,000–16,000 at peak Scale-out 503s, slot warm-up Pay for peak instances
1× P1v3 Premium, 8 GB, pre-warmed ~₹14,000–18,000 OOM loops, scale-out cold start Higher floor cost
App Insights ingestion Per-GB telemetry ~₹1,000–3,000 (diagnosis itself) Sample high-traffic apps
NAT Gateway Hourly + per-GB egress ~₹1,500–3,000 SNAT exhaustion at scale Needs VNet integration

Interview & exam questions

1. A user reports 502 Bad Gateway but App Service logs show the request succeeding. What’s happening and how do you confirm? The 502 is coming from an upstream (Front Door / Application Gateway) timing out the backend, not from App Service — the worker responded, just slower than the front end’s timeout. Confirm by comparing App Insights request duration against the gateway’s requestTimeout (App Gateway, default 60 s) or origin response timeout (Front Door). Fix the slow backend and/or raise the upstream timeout.

2. A freshly deployed Linux container returns 502 even though its logs show the app started. Most likely cause? The container listens on a port the platform isn’t probing — WEBSITES_PORT is unset or wrong (default probe is 80), or the app bound 127.0.0.1 instead of 0.0.0.0. Confirm via default_docker.log (“didn’t respond to HTTP pings on port: 80”). Set WEBSITES_PORT to the real port and bind all interfaces.

3. What is SNAT port exhaustion and why does it pass in test but fail in production? Each instance has a finite pool (~128 pre-allocated) of SNAT ports for outbound connections. Under low test load you never exhaust it; under production load an app that opens a new connection per request (no HttpClient reuse) burns through the pool, and new outbound calls fail — surfacing as intermittent 502s and dependency timeouts. Confirm via the SNAT Port Exhaustion detector / SnatConnectionCount Failed metric. Fix with connection reuse and a NAT Gateway.

4. Difference between a 502 and a 503 from App Service, conceptually? Both are reported by the front end. 502 Bad Gateway = the front end reached a worker but got a broken/no/timed-out response (crash, wrong port, upstream timeout, SNAT failure). 503 Service Unavailable = the front end had no healthy worker to send to (restart in progress, plan over-commit, quota exceeded, all instances evicted by health check). “Bad answer” vs “no answer.”

5. How do you eliminate cold starts on App Service? Enable Always On (keeps a warm worker resident, defeating idle-unload; requires B1+), disable ARR affinity for stateless apps so load spreads and all instances stay warm, use pre-warmed instances on Premium v3 to cover scale-out, and deploy via slot-swap with warm-up so production never serves a cold worker. The underlying cost (JIT, DI build, image pull) isn’t removed — you ensure a warm worker already paid it.

6. An app restart-loops with no application exception in the logs and its secret-backed settings appear empty. What do you check? A Key Vault reference failure at boot: the managed identity is missing or lacks RBAC/access-policy on the vault, the vault firewall is blocking, or the secret is deleted/disabled/mis-URI’d, so the reference resolves to nothing and the app crashes on the empty value. Check the Environment variables blade for a red reference error, az webapp identity show, and az role assignment list. Fix the identity/role/firewall/secret.

7. Why can a health check take your entire app offline, and how do you prevent it? App Service evicts instances that fail the health probe. If the health path depends on a downstream that goes down, every instance fails simultaneously and all are evicted → 503 everywhere (a dependency blip becomes a total outage). Prevent it by keeping the health path shallow (separate liveness from readiness), not hard-failing on optional dependencies, and tuning WEBSITE_HEALTHCHECK_MAXPINGFAILURES (2–10) to ride out transient failures.

8. A dev app serves 503 every afternoon and recovers overnight. Cause and fix? It’s on a Free (F1) or Shared (D1) plan and is exhausting the daily CPU-minute quota, after which App Service stops it until the 00:00 UTC reset. Confirm the tier with az appservice plan show. Fix by moving to B1+, which has no daily CPU quota (and gains Always On).

9. What does WEBSITES_CONTAINER_START_TIME_LIMIT do and when do you change it? It’s the number of seconds the platform waits for a container to start responding before failing the site start — default 230 s, max 1800 s. Raise it for legitimately heavy containers (large image pull + slow init), but treat a long start as a smell: shrink the image, use a same-region registry, and enable Always On so the slow start happens at deploy, not on idle wake.

10. The app OOM-restart-loops under load. Does scaling out fix it? What does? No — scaling out adds instances that each hit the same per-instance memory ceiling, so they all OOM. The fix is to scale up to a SKU with more RAM (e.g. P1v3 8 GB, P2v3 16 GB) and/or fix the leak (capture a memory dump via Kudu / the Memory Analysis detector). Memory is per-instance, so only more RAM-per-instance or less memory-use helps.

11. Which App Service tier unlocks autoscale, and which adds pre-warmed instances? Standard (S1+) unlocks autoscale and up to 5 slots. Premium v3 (P1v3+) adds pre-warmed instances (covering scale-out cold start), much more RAM, and up to 20 slots. Basic (B1) has Always On and no daily quota but only manual scale.

12. You see 502s right after every slot swap that clear within a minute. Why? The staging slot wasn’t warm at swap time, so production briefly served cold workers paying first-request cost. Fix with slot warm-up: set WEBSITE_SWAP_WARMUP_PING_PATH and WEBSITE_SWAP_WARMUP_PING_STATUSES so the swap completes only after the slot responds healthily — instances stay warm through the swap.

These map to AZ-204 (Developer Associate)monitor, troubleshoot and optimize Azure solutions, Application Insights, App Service configuration — and AZ-104 (Administrator)configure and manage App Service plans, scaling, deployment slots, and monitoring. The networking-cost angle (SNAT, NAT Gateway, VNet integration) touches AZ-700. A compact cert-mapping for revision:

Question theme Primary cert Exam objective area
502 vs 503, ANCM codes AZ-204 Troubleshoot solutions; App Service config
App Insights Failures / KQL AZ-204 Instrument & monitor; troubleshoot
Plan tiers, autoscale, slots AZ-104 Configure & manage App Service plans
Health check, restarts, HA AZ-104 Monitor & maintain Azure resources
SNAT, NAT Gateway, VNet integration AZ-700 Design & implement network connectivity
Key Vault references, managed identity AZ-204 / AZ-500 Secure app config; manage identities

Quick check

  1. A user gets 502 but App Service’s own request log shows the request completed in 80 seconds. Where is the 502 actually coming from, and what’s the one setting you check first?
  2. Your custom container’s logs say it started successfully, yet the site returns 502. What is the single most likely cause and the exact log line that confirms it?
  3. True or false: scaling out to more instances is the correct fix for an app that keeps getting OOM-killed.
  4. An app restart-loops with no exception logged and its connection string (a Key Vault reference) appears empty. Name two things to check.
  5. Your dev app on F1 dies with 503 every afternoon and comes back by morning. Why, and what’s the fix?

Answers

  1. The 502 is from the upstream Front Door / Application Gateway timing out the backend — the worker responded, just slower than the front end’s timeout. First setting to check: the gateway’s requestTimeout (App Gateway default 60 s) or Front Door origin response timeout; compare it to the request’s actual duration.
  2. WEBSITES_PORT is unset or wrong (the platform probes port 80 by default; your container listens elsewhere, or bound 127.0.0.1 instead of 0.0.0.0). The confirming line in default_docker.log is “didn’t respond to HTTP pings on port: 80, failing site start.” Fix by setting WEBSITES_PORT to the real port.
  3. False. Memory is a per-instance ceiling; every scaled-out instance hits the same limit and OOMs. The fix is to scale up to a higher-RAM SKU (and/or fix the leak), not out.
  4. Check (a) that a managed identity is enabled on the app (az webapp identity show) and (b) that the identity has get/list on the vault’s secrets (az role assignment list for Key Vault Secrets User, or an access policy). Also verify the vault firewall isn’t blocking and the secret/URI is valid.
  5. F1 (Free) has a daily CPU-minute quota; once exhausted App Service stops the app until the 00:00 UTC reset. Fix by moving to B1 or higher, which has no daily CPU quota (and gains Always On).

Glossary

Next steps

You can now localise any App Service 5xx to a hop and fix it. Build outward:

AzureApp ServiceTroubleshooting502503Application InsightsWeb AppsObservability
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading