Azure Troubleshooting

Troubleshooting Azure SQL Database: Connectivity, Timeouts, Throttling & Blocking

At 02:14 on a release night the pager fires: “checkout API returning 500s, can’t reach the database.” The logs show three exceptions interleaved — A network-related or instance-specific error occurred, Login failed for user, and The service has encountered an error processing your request (40613). The on-call developer is convinced the database is down. It is not — it is healthy, serving four other apps fine. What changed is one firewall rule, one connection-pool setting, and a peered VNet that quietly started routing egress through a new NAT gateway. Three layers, three errors, one incident — and the only way to resolve it fast is to know exactly which layer each symptom belongs to.

Azure SQL Database is a Platform-as-a-Service relational engine: Microsoft runs the SQL Server engine, OS, patching, HA replicas and backups; you own the schema, queries, indexes, connection settings, and the network path that reaches it. That managed boundary is a blessing for uptime and a curse for troubleshooting, because failures you once debugged by RDP-ing into a box now surface as error numbers over the wire. There is no service to restart, no errorlog on a C: drive to tail, no host to ping. You diagnose from error codes, Dynamic Management Views (DMVs), Query Store, Azure Monitor metrics, and the connection policy of the logical server.

This article leads with diagnosis, not theory. We walk the connection path the way a packet does — DNS → firewall → connection policy/gateway → login → engine → query — naming at every hop the real error number, the exact az/T-SQL/KQL command to confirm it, and the fix. Then we go deep on the three failure families that page senior engineers most: transient connectivity and the redirect-vs-proxy port trap, resource throttling (DTU/vCore limits, log-rate governance, worker exhaustion, tempdb pressure), and blocking, locking and deadlocks. Because this is a reference you keep open mid-incident, the playbook, the error codes, the limits and the DMVs are all laid out as scannable tables — read the prose once, then keep the tables open at 02:14. By the end you can take any Azure SQL incident and, within minutes, point at the failing layer and prove it.

To frame the whole field before the deep dive, here is every symptom class this article covers, the hop it belongs to, the question it forces, and the one place to look first:

Symptom class Connection hop it lives at First question to ask First place to look Most common single cause
Can’t connect at all from a VNet DNS / firewall Does the name resolve, and is the egress IP allowed? nslookup + az sql server firewall-rule list Private DNS not linked, or 40615 firewall
Connects then queries hang/drop Connection policy Can the client reach 11000–11999, or only 1433? az sql server conn-policy show + nc -vz Redirect policy + blocked high-port range
Login failed despite right network Login / auth Right database, right identity model? sys.database_principals in the target DB 18456: contained user vs master confusion
Intermittent timeouts under load Engine (governance) Pass at rest, fail under load? sys.dm_db_resource_stats (last hour) 10928/10929 throttling (workers/log rate)
Whole app halts, CPU is fine Engine (locks) One transaction blocking a chain? Blocking-chain DMV query / sp_who2 A long writer holding LCK_M_X
Pool drains, “can’t get a connection” Client driver “From the pool” in the message? App pool config + upstream blocking Leaked connections or blocking-amplified pool

What problem this solves

Azure SQL failures are miscategorised more than any other Azure service, and miscategorisation turns a 10-minute fix into a 3-hour outage. One surface symptom — “the app can’t talk to the database” — can be a DNS misconfiguration, a missing firewall rule, a connection-policy mismatch blocking a port range, an exhausted client pool, a login that lost permission, a database at its DTU ceiling, or one runaway query holding an exclusive lock 400 sessions queue behind. Each has a different owner, fix and blast radius. Treat a throttling event (10928) like a connectivity bug and you waste an hour on firewalls while an unindexed query saturating the log melts your tail latency.

The pain shows up in three shapes. First, “works on my machine, fails in production”: the app connects fine from a laptop or Cloud Shell but times out from inside a peered VNet — almost always the redirect vs proxy connection policy and the 11000–11999 port range that on-prem firewalls and hub-and-spoke topologies silently drop. Second, intermittent timeouts under load that no single query explains: the database hits resource governance limits (DTU/vCore, log rate, workers, tempdb) and sheds work with 10928/10929/40613, but retry logic masks it until the retries pile up. Third, the whole app halts while CPU looks fine: one transaction is blocking a chain of others, sessions stack behind LCK_M_* waits, the pool drains, and even healthy queries can’t get a connection.

Who hits this? Everyone running Azure SQL at scale — especially teams migrating from on-prem SQL Server (carrying forward 1433-only firewall assumptions), teams adopting Private Link (who break DNS in subtle ways), and teams on Basic/Standard DTU or small vCore tiers (who hit governance limits early). The three failure families differ in owner, blast radius and the first move — naming them apart up front is the whole discipline:

Failure family Typical owner Blast radius Pass at rest? Fixed by restart? Right first move
Connectivity (DNS/firewall/policy) Network + DBA One app / one subnet Often (from elsewhere) No (no host to restart) Walk hops 1–3 left to right
Auth (18456/40532/233) DBA + app The mis-credentialed app only No (fails identically) No Check identity exists in the right DB
Throttling (10928/10929/log rate) DBA + app Every session on the DB/pool Usually (load-driven) No (re-throttles instantly) sys.dm_db_resource_stats peaks
Blocking / deadlock App + DBA Cascades across all sessions Sometimes (load-driven) Kill releases it, doesn’t fix it Blocking-chain DMV → root blocker
Client pool / retry App developer The one app process No No (re-exhausts) Look for “from the pool” + dispose

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should be comfortable with Azure SQL basics — that a logical server (*.database.windows.net) hosts one or more databases, the DTU vs vCore models, and connecting with sqlcmd/SSMS/Azure Data Studio. You should know what a VNet, subnet, NSG and route table are, and run az in Cloud Shell. We define every error code and DMV; we don’t re-teach T-SQL.

Where this fits: an Advanced Troubleshooting article in the Azure data track. It assumes the networking primers (Azure Private Endpoint vs Service Endpoint, Azure Virtual Network, Subnets & NSGs) and turns them into a diagnostic discipline. The same connection-path model applies, with minor differences, to Azure SQL Managed Instance and Azure Database for PostgreSQL/MySQL. It is the data-tier sibling of Troubleshooting Azure App Service: 502/503, Cold Starts & Restart Loops — the same “localise to the failing hop” method, applied to the database instead of the web tier.

A quick map of who confirms what during an incident, so you call the right person fast and stop the wrong team from chasing a red herring:

Layer What lives here Who usually owns it Failure classes it can cause Tool to confirm
Client driver / pool Connection string, pool, retry, timeouts App / dev team “From the pool” timeouts, no retry → flaky App config; the exception text
DNS resolution Public CNAME vs privatelink zone Network + platform Private Link “resolves to public IP” nslookup; az network private-dns
Firewall (server/DB/VNet) IP allow-lists, VNet rules, public access DBA + network 40615, 40914, flat refusal az sql server firewall-rule list
Gateway / connection policy Redirect vs Proxy, 1433 vs 11000–11999 Network + DBA 10060/10054 connect-then-timeout az sql server conn-policy show; nc
Login / auth SQL logins, Entra principals, TLS DBA + identity 18456, 40532, 233 sys.database_principals; az sql ... ad-admin
Engine governance DTU/vCore, log rate, workers, tempdb DBA + app 10928/10929/40544/40551/40553/40613 sys.dm_db_resource_stats
Locks / concurrency Transactions, isolation, indexes App + DBA Blocking, deadlock (1205) sys.dm_exec_requests, sys.dm_tran_locks

Core concepts

Before the playbook, fix five mental models. Every diagnostic maps back to one of them.

The connection is a path, not a wire — the first failing hop owns the incident. A request to a row traverses, in order: client driver → DNS resolution of <server>.database.windows.net → TCP to the gateway → firewall → connection-policy decision (redirect or proxy) → possibly a second TCP hop to the node on 11000–11999 → TLS handshake → login → engine → query → locks and waits. Each hop fails its own way with its own error. Identify the leftmost failing hop; everything to its right is a red herring. A 40615 (firewall) means you never reached login — chasing a “login failed” theory is wasted effort.

Logical server vs database — two firewalls, two scopes. Azure SQL has a logical server (the *.database.windows.net endpoint) hosting one or more databases. Firewall rules exist at both levels: server-level rules (stored in master, apply to every database) and database-level rules (stored in the user database, checked first). A request is allowed if it matches a database-level or a server-level rule. That two-tier model is why an IP can work for one database and be rejected for another on the same server.

Control plane and data plane fail independently. Creating a server, opening a firewall rule, scaling the tier and configuring Private Link are control-plane operations — done with az sql/ARM/Bicep/portal, governed by Azure RBAC. Logging in and running queries are data-plane — governed by SQL logins/Entra principals and database permissions. A correct firewall rule with a wrong password still fails. The plane tells you where to look: az/portal for control-plane, T-SQL/DMVs for data-plane.

Throttling is the platform protecting itself, not a bug. Azure SQL governs you to your purchased limits — CPU, memory, log generation rate, worker threads, concurrent sessions, tempdb — and rejects or queues work over them with specific codes (10928/10929/40544/40551/40613). A throttling incident names exactly which resource you ran out of. The fix is never “restart”; it is “reduce demand, fix the query, or buy more limit.”

Transient vs persistent — why retry logic is mandatory. Azure SQL performs reconfigurations (failovers, node moves for patching, load balancing). During one (usually under ~60 seconds) your connection is dropped and you get a transient error (40613, 40197, 10928, 10053, 40143, 233) — expected and recoverable; reconnect and retry with exponential backoff. A persistent error (40615 firewall, 18456 auth) fails every retry identically; retrying just burns time. Distinguishing the two and retrying only the transient set is the key client-side competence.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters to the incident
Logical server The *.database.windows.net endpoint + security boundary Subscription / resource group Server firewall + Entra admin live here
Database One database hosted on a logical server On the server Database-level firewall + contained users
Gateway Regional front-end terminating your first TCP hit Microsoft-managed Always hit on 1433; emits the redirect decision
Connection policy Redirect / Proxy / Default routing after handshake Server setting Decides if you need 11000–11999
DTU Blended CPU/IO/memory unit (one composite number) Service objective One metric hides which resource throttled
vCore A dedicated logical CPU with separate metrics Service objective CPU/Data IO/Log IO/Workers split out
Resource governance Capping you to the tier’s limits Engine Sheds work with 10928/10929/4055x
Log rate governor Cap on transaction-log MB/s per tier Engine The most-missed throttle (LOG_RATE_GOVERNOR)
Worker thread Engine thread running one active request Engine Hidden ceiling → 10928 Resource ID 1
DMV System view exposing live engine state In each database sys.dm_exec_requests et al.
Wait type What a request is currently waiting on DMV column LCK_M_X, PAGEIOLATCH_*, THREADPOOL
Blocking One session waiting on a lock another holds Engine Persists until released; won’t auto-resolve
Deadlock Mutual blocking; engine kills a victim (1205) Engine Auto-resolves but fails a transaction
Reconfiguration Platform failover/node-move for HA/patching Platform Briefly drops connections → transient errors
Private Link Private IP in your VNet for the server VNet + DNS zone Resolves via privatelink.database.windows.net

The error-number reference

Before the per-hop anatomy, here is the lookup table you scan first: every error number you realistically see from Azure SQL, what it means, the likely cause, how to confirm it, and the first fix. The class column tells you whether to retry (transient) or stop and fix config (persistent / control-plane). Bookmark this — at 02:14 the error number is the diagnosis.

Code Class Meaning (message gist) Likely cause How to confirm First fix
40615 Persistent Cannot open server '<x>' ... IP '<ip>' is not allowed Egress IP in no firewall rule (NAT IP for VNet traffic) IP is in the error; az sql server firewall-rule list Add egress IP, or VNet rule / Private Link
40914 Persistent Cannot open server ... not allowed by VNet rule Subnet not on the VNet-rule allow list / missing service endpoint az sql server vnet-rule list Add the subnet via VNet rule + Microsoft.Sql endpoint
18456 Persistent Login failed for user '<u>' (state narrows it) Wrong password; user not in this DB; wrong tenant token; on master sys.database_principals in target DB Connect to right DB; create/map the principal
40532 Persistent Cannot open server "<x>" requested by the login Server name in conn string didn’t parse; wrong user@server format Check the FQDN and User ID format Fix server name / connection-string format
233 Often transient error during the pre-login handshake TLS mismatch (client 1.0/1.1 vs server 1.2+); Encrypt off az sql server show --query minimalTlsVersion Encrypt=True; upgrade driver; enable TLS 1.2
10060 Transient Connection Timeout Expired / TCP timeout Redirect port range blocked; NSG/firewall on the path nc -vz srv 1433 ok but 11000 times out Force Proxy, or open 11000–11999 to Sql tag
40613 Transient Database '<db>' ... is not currently available Reconfiguration/failover (seconds) or overloaded & recycling Clears in <1 min → transient; persists → check governance Retry w/ backoff; if persistent, scale/fix load
10928 Transient Resource ID: %d ... limit ... has been reached Worker threads (ID 1) or sessions (ID 2) exhausted max_worker_percent / max_session_percent near 100% Break blocking; fix pool; scale up
10929 Transient ... minimum guarantee is %d ... server is too busy Server / elastic pool saturated; min guarantee unmet Pool metrics; pool-mate noisy neighbour Scale pool; isolate noisy DB
40544 Persistent The database has reached its size quota. Data size hit the tier max SUM(size)*8/1024 vs tier max Free space, or raise maxSizeBytes/scale
40549 Transient Session is terminated ... long-running transaction A transaction held resources too long Find the long-running tran in sys.dm_exec_requests Shorten/batch the transaction
40551 Transient Session terminated ... excessive TEMPDB usage tempdb blown (sorts/spills/temp tables) Wait RESOURCE_SEMAPHORE; query plan spills Reduce sort/hash; add index; paginate
40553 Transient Session terminated ... excessive memory usage Oversized memory grant (huge sort/hash) RESOURCE_SEMAPHORE waits; grant size in plan Cut grant: index, smaller batch, MAX_GRANT_PERCENT
49918 Control-plane Cannot process request. Not enough resources Too many operations in progress for the subscription The failing call is az sql ... create/update Serialise + back off management calls
49919 Control-plane ... too many create or update operations ... Subscription create/update op cap hit ARM loop over many DBs Batch/throttle the deployment loop
49920 Control-plane ... too many requests for the subscription Subscription request-rate cap Automation hammering the control plane Add delays/retries to the pipeline
1205 Transient ... chosen as the deadlock victim Two+ transactions locked in opposite order Deadlock graph in system_health ring buffer Consistent lock order; retry 1205
40143 Transient ... encountered a connection error Reconfiguration / transport blip Clears on reconnect Retry with backoff
40197 Transient The service has encountered an error (reconfiguration) Platform reconfiguration in progress Brief metric gap; recovers Retry with backoff
10053 / 10054 Transient transport-level error / connection reset Connection dropped (reconfig, idle reset, redirect fail) Correlate with reconfig or the redirect trap Retry; check connection policy

Three reading notes that save the most time:

Distinction The trap How to tell them apart
Transient vs persistent Retrying a persistent error burns minutes and looks like “flaky DB” Transient clears within ~60 s on reconnect (40613/40197/10928); persistent fails identically every attempt (40615/18456/40544)
Data-plane vs control-plane throttle 10928 and 49918 both say “limit reached” but live in different worlds 10928/10929 come from a query; 49918/49919/49920 come from an az/ARM management call
Network timeout vs slow query Both surface as “timeout” in the app A connection timeout (15 s) = hops 1–4; a command timeout (30 s) = the query ran too long

The connection path, hop by hop

The spine of the article. Walk the path left to right; for each hop you get the failure signature, the confirm command, and the fix. This summary maps each hop to its error and tool before we go deep:

# Hop What it does Signature error(s) Confirm with Fix lever
1 DNS Resolve *.database.windows.net timeout (resolves to public IP) nslookup; az network private-dns ... Link the privatelink zone; forward to 168.63.129.16
2 Firewall Allow the egress IP / subnet 40615, 40914 az sql server firewall-rule list Add IP, VNet rule, or Private Link
3 Connection policy Redirect vs Proxy after handshake 10060, 10054 az sql server conn-policy show; nc Force Proxy or open 11000–11999
4 Login / auth Authenticate the principal 18456, 40532, 233 sys.database_principals; ad-admin list Right DB + identity; TLS 1.2
5 Engine governance Cap to tier limits 10928/10929/4055x/40613 sys.dm_db_resource_stats Reduce demand; scale tier
6 Query / locks Run the query under concurrency blocking; 1205 sys.dm_exec_requests, sys.dm_tran_locks Kill blocker; index; RCSI

Hop 1 — DNS: resolving *.database.windows.net

Your driver first resolves myserver.database.windows.net. By default this is a public CNAME chain ending at a regional gateway’s public IP. With Private Link it must instead resolve to a private IP in your VNet via the privatelink.database.windows.net private DNS zone. The number-one Private Link failure is DNS resolving to the public IP because the private DNS zone is missing, not linked to the right VNet, or overridden by a custom DNS server that doesn’t forward to Azure DNS (168.63.129.16).

Symptom. Timeouts or A network-related or instance-specific error only from inside the VNet meant to use Private Link; fine from the public internet. Or the reverse: the public path is intentionally disabled and a new subnet not linked to the DNS zone fails.

Resolution outcomes and what each tells you — read your nslookup result against this:

What nslookup returns What it means Healthy for which setup? Next step
CNAME → privatelink...private A (10.x/172.16-31.x/192.168.x) Private DNS zone is working Private Link Good; move to firewall/policy
Public gateway IP, from inside the VNet Zone missing/not linked, or custom DNS not forwarding Public access only Link the zone; fix forwarder
Public gateway IP, from the internet Normal public resolution Public access Good; move to firewall
NXDOMAIN / no record Wrong FQDN, or A record never registered Neither Check endpoint state + record-set
Private A but you intended public Stale private zone overriding Public access Unlink the zone or fix split-horizon

Confirm. Resolve from the client, then check the private DNS zone and its VNet links from the control plane:

# From a VM inside the VNet — what does the name resolve to?
nslookup myserver.database.windows.net
# Healthy Private Link: a CNAME to myserver.privatelink.database.windows.net then a
# PRIVATE A record (10.x / 172.16-31.x / 192.168.x). Broken: a PUBLIC gateway IP.

# Is the zone linked to THIS VNet, and does the endpoint have an A record?
az network private-dns zone show -g rg-data -n privatelink.database.windows.net -o table
az network private-dns link vnet list -g rg-data -z privatelink.database.windows.net -o table
az network private-dns record-set a list -g rg-data -z privatelink.database.windows.net -o table

Fix. Create/link the privatelink.database.windows.net zone to the VNet, ensure the endpoint registered its A record, and forward custom DNS to 168.63.129.16 (Azure DNS). In Bicep, the private endpoint + zone group:

resource pe 'Microsoft.Network/privateEndpoints@2023-11-01' = {
  name: 'pe-sql'
  location: location
  properties: {
    subnet: { id: subnetId }
    privateLinkServiceConnections: [ {
      name: 'sql'
      properties: {
        privateLinkServiceId: sqlServerId
        groupIds: [ 'sqlServer' ]   // the correct groupId for Azure SQL DB
      }
    } ]
  }
}

resource zoneGroup 'Microsoft.Network/privateEndpoints/privateDnsZoneGroups@2023-11-01' = {
  parent: pe
  name: 'default'
  properties: {
    privateDnsZoneConfigs: [ {
      name: 'sql'
      properties: { privateDnsZoneId: privateDnsZoneId }   // privatelink.database.windows.net
    } ]
  }
}

The private DNS zone and group IDs are exact and unforgiving — the right values for Azure SQL DB versus its neighbours:

Resource Private DNS zone groupId (sub-resource) Note
Azure SQL Database (logical server) privatelink.database.windows.net sqlServer This article’s case
Azure SQL Managed Instance privatelink.<region>.database.windows.net (MI domain) (MI endpoint) Different domain; MI is VNet-native already
Azure Synapse (SQL) privatelink.sql.azuresynapse.net Sql Synapse dedicated SQL
Azure Database for PostgreSQL (flex) privatelink.postgres.database.azure.com postgresqlServer Same hop model, different zone
Storage (blob, for bcp/BACPAC) privatelink.blob.core.windows.net blob Relevant for import/export paths

Subtle but lethal: with Private Link, set the connection policy to Proxy — Redirect would hand your client a node port (11000–11999) and node address the private-endpoint path doesn’t expose cleanly. More in Hop 3.

Hop 2 — Firewall: server-level, database-level, VNet rules, public access

If DNS resolved correctly, the next gate is the firewall. Azure SQL rejects an IP matching no rule with error 40615: Cannot open server '<name>' requested by the login. Client with IP address '<ip>' is not allowed to access the server. The IP in the message is the public egress IP the gateway saw — for a VNet behind NAT/Azure Firewall that’s the NAT public IP, not the VM’s private IP, a constant source of confusion.

Symptom. 40615 (server firewall) on connect. Less commonly 40914 when a VNet rule / service-endpoint check fails (request from a subnet not on the allow list), or a flat connection refusal when Public network access is Disabled and you’re outside the private path.

Azure SQL has four distinct ways to admit traffic, evaluated in a definite order. Knowing which one applies stops you adding the wrong kind of rule:

Mechanism Scope Stored in Checked Best for Gotcha
Database-level firewall rule One database The user DB (sys.database_firewall_rules) First Per-DB isolation on a shared server Easy to forget it exists; overrides need not match server rules
Server-level firewall rule Every DB on the server master (sys.firewall_rules) After DB rules A shared allow-list NAT IPs rotate → constant churn
VNet rule (service endpoint) A subnet Server config Per request In-VNet apps without chasing IPs Needs Microsoft.Sql endpoint on the subnet → 40914 if missing
Private endpoint A private IP in your VNet VNet N/A (private path) Strong isolation; disable public access Breaks if DNS resolves public (Hop 1)
“Allow Azure services” (0.0.0.0) Any Azure IP master (special rule) With server rules Quick demos Far too broad for production — avoid

Confirm. List both firewall layers and the public-access setting:

# Server-level firewall rules (live in master)
az sql server firewall-rule list -g rg-data -s myserver -o table

# Is public access disabled entirely? (then only Private Link works)
az sql server show -g rg-data -n myserver --query "publicNetworkAccess" -o tsv

# VNet rules (service-endpoint based)
az sql server vnet-rule list -g rg-data -s myserver -o table

For the database-level rules (checked before server-level), query sys.database_firewall_rules from the user database and sys.firewall_rules from master. The error text gives the offending IP; curl -s ifconfig.me from the client shows its egress IP.

Fix. Add the egress IP/range as a server-level rule, or (preferred for VNet traffic) use a VNet rule / service endpoint to allow the subnet rather than chase changing NAT IPs, or move to Private Link and disable public access. The “Allow Azure services and resources to access this server” toggle is the special 0.0.0.0 rule — convenient but broad; avoid it in production.

# Add a server firewall rule for a specific egress IP range
az sql server firewall-rule create -g rg-data -s myserver \
  -n office-egress --start-ip-address 203.0.113.10 --end-ip-address 203.0.113.10

# Better for VNet traffic: allow the subnet via a VNet rule (needs Microsoft.Sql service endpoint on the subnet)
az network vnet subnet update -g rg-net --vnet-name vnet-app -n snet-app \
  --service-endpoints Microsoft.Sql
az sql server vnet-rule create -g rg-data -s myserver -n allow-app-subnet \
  --vnet-name vnet-app --subnet snet-app
// Server firewall rule + a VNet rule, as code
resource fw 'Microsoft.Sql/servers/firewallRules@2023-08-01-preview' = {
  parent: sqlServer
  name: 'office-egress'
  properties: { startIpAddress: '203.0.113.10', endIpAddress: '203.0.113.10' }
}

resource vnetRule 'Microsoft.Sql/servers/virtualNetworkRules@2023-08-01-preview' = {
  parent: sqlServer
  name: 'allow-app-subnet'
  properties: { virtualNetworkSubnetId: subnetId, ignoreMissingVnetServiceEndpoint: false }
}

Pick the right admission model by where the client lives — this decision table ends the “which rule do I add?” debate:

If the client is… And you want… Use Why
A fixed office / on-prem range Simplicity Server firewall rule One stable IP range
An app in a peered/spoke VNet No IP churn VNet rule (service endpoint) Allow the subnet, not the rotating NAT IP
Anything, with strong isolation Private-only Private endpoint + disable public access Traffic never touches the public gateway
A short-lived demo Speed over safety (temporarily) server rule for your IP Remove it after; never 0.0.0.0 in prod
Many Azure services, low sensitivity Convenience “Allow Azure services” Accept the broad blast radius (rarely worth it)

Hop 3 — Connection policy: Redirect vs Proxy, and the 11000–11999 trap

The single most misunderstood failure in Azure SQL networking, and the cause of the classic “works from on-prem/Cloud Shell, times out from a peered VNet.” You always hit the gateway on TCP 1433 first. What happens next depends on the connection policy:

Policy First hop Subsequent traffic Ports the client must reach Latency / throughput When it applies
Proxy Gateway on 1433 All traffic stays proxied through the gateway on 1433 1433 only Slightly higher latency, lower throughput (extra hop) Default from outside Azure; required when only 1433 is open; mandatory for Private Link
Redirect Gateway on 1433 for the handshake After handshake the client is redirected straight to the database node 1433 and 11000–11999 Lower latency, higher throughput (direct) Default from inside Azure; the perf choice when you control the ports
Default Picks Redirect for in-Azure clients, Proxy for outside-Azure clients depends on where the client is The out-of-the-box setting

The trap: a client inside a VNet defaults to Redirect. After the 1433 handshake, the gateway says “now talk to the node on 11000–11999.” If that subnet only permits 1433 (an NSG, hub firewall, forced-tunnel UDR, or on-prem firewall on the return path blocks the high ports), the handshake succeeds but the redirect fails. The symptom is maddening: connections sometimes work but queries hang or drop with 10060/10054 timeouts, and it “only happens from the peered VNet,” never from Cloud Shell (which sits outside and falls back to Proxy).

Symptom. Intermittent connect-then-timeout from inside a VNet (especially hub-and-spoke with a central firewall or forced tunneling); fine from outside Azure. Connection Timeout Expired after a successful-looking connect, or error: 10060.

The exact ports each scenario must allow outbound — print this for the network team:

Source → SQL Policy in effect Must allow outbound Common blocker
Cloud Shell / laptop (outside Azure) Proxy (default) TCP 1433 Corporate firewall blocking 1433
In-VNet app, default policy Redirect (default) TCP 1433 + 11000–11999 NSG / hub firewall dropping 11000–11999
In-VNet app, forced to Proxy Proxy TCP 1433 None (that’s the point)
On-prem over ExpressRoute/VPN Set to Proxy TCP 1433 On-prem firewall on return path for high ports
Private Link (any source) Proxy (set it) TCP 1433 to the private IP DNS resolving public (Hop 1)

Confirm. Check the connection policy and test the port range:

# What connection policy is set on the server?
az sql server conn-policy show -g rg-data -s myserver --query "connectionType" -o tsv
# Returns: Default | Proxy | Redirect
# From the VNet client, can you actually reach the redirect port range?
# Test 1433 (should work) then a sample high port.
nc -vz myserver.database.windows.net 1433
nc -vz myserver.database.windows.net 11000
# If 1433 connects but 11000 times out, Redirect cannot complete from here.

Fix. Either open 11000–11999 outbound from the subnet to the Sql service tag (the lowest-latency answer for in-VNet workloads), or force the connection policy to Proxy so everything funnels over 1433 (simpler, slightly higher latency/lower throughput, the right call for Private Link and where you can’t open the high-port range).

# Option A: force Proxy so only 1433 is needed
az sql server conn-policy update -g rg-data -s myserver --connection-type Proxy

# Option B (preferred for performance): allow the redirect port range to the Sql service tag
az network nsg rule create -g rg-net --nsg-name nsg-app -n allow-sql-redirect \
  --priority 200 --direction Outbound --access Allow --protocol Tcp \
  --destination-address-prefixes Sql --destination-port-ranges 1433 11000-11999
// Force Proxy connection policy as code
resource connPolicy 'Microsoft.Sql/servers/connectionPolicies@2023-08-01-preview' = {
  parent: sqlServer
  name: 'default'
  properties: { connectionType: 'Proxy' }
}

Proxy vs Redirect is a real trade-off, not just a workaround — choose deliberately:

Dimension Proxy Redirect
Ports needed 1433 only 1433 + 11000–11999
Latency Higher (gateway relays every packet) Lower (direct to node)
Throughput Lower under heavy load Higher
Works through restrictive firewalls Yes Only if high ports are opened
Private Link compatibility Recommended Awkward (node address/ports)
Failover transparency Gateway abstracts node moves Client may need to re-resolve node
Best for On-prem, Private Link, locked-down subnets In-VNet, latency-sensitive, controllable ports

Rule of thumb: Private Link → Proxy. In-VNet, performance-sensitive, ports controllable → Redirect + open 11000–11999. On-prem over ExpressRoute/VPN where you can only guarantee 1433 → Proxy. Forcing Proxy is also the fastest way to prove the high-port range is the problem: if Proxy makes the timeouts vanish, the redirect ports were being blocked.

Hop 4 — Login & authentication: 18456, 40532, 233

The engine now authenticates you. The canonical failure is 18456 — Login failed for user ‘<user>’ (generic; the state suffix narrows it). Common causes: wrong password; the user doesn’t exist in that database (contained-user vs server-login confusion); the Entra ID token is for the wrong tenant/audience; or you connected to master when the user only exists in the user DB. A cousin, 40532 — Cannot open server “<name>” requested by the login, means the server name in the connection string didn’t match (a database parsed into the server name, or the wrong User ID=user@server format some drivers require).

Symptom. 18456 (auth), 40532 (server-name parse / login routing), or 233/10060 (handshake closed before auth — often a TLS mismatch, below).

The 18456 state number is the real diagnosis — it tells you why the login failed when the message deliberately won’t. The states you actually meet:

18456 State Meaning Most likely cause Fix
2 / 5 Invalid user ID Login/user doesn’t exist Create the login/user; check spelling
6 Attempt with a Windows login over SQL auth Wrong auth mode Use the correct auth method
7 Login disabled, and password wrong Disabled account Re-enable; reset password
8 Password mismatch Wrong password Fix the secret (check Key Vault)
38 / 40 Cannot open the database requested DB doesn’t exist / no access / wrong DB Connect to the right DB; grant access
102–111 Entra/AAD auth failures Token audience/tenant/expiry issue Re-acquire token for database.windows.net
(generic, no state surfaced) Client masks it Connected to master not the user DB Set Initial Catalog/-d to the user DB

Confirm. Establish which identity model you’re using and whether the principal exists where you expect:

-- In the USER database: does the contained user / Entra principal exist?
SELECT name, type_desc, authentication_type_desc
FROM sys.database_principals
WHERE type IN ('S','E','X','U');   -- SQL user, Entra user, Entra group, Windows

-- In master: server-level logins (for non-contained logins)
SELECT name, type_desc FROM sys.sql_logins;

For token auth, verify the server’s Entra admin with az sql server ad-admin list -g rg-data -s myserver -o table.

Fix. Match the auth model to where the principal lives. For SQL auth, create a login in master and a user in the database, or a contained database user directly in the user DB. For Entra auth, ensure the server has an Entra admin, the database maps the user, and the client requests an https://database.windows.net/ audience token. Concrete fixes:

-- Contained user for an app, scoped to one database (preferred for least privilege)
CREATE USER [app_orders] WITH PASSWORD = '<strong-secret>';
ALTER ROLE db_datareader ADD MEMBER [app_orders];
ALTER ROLE db_datawriter ADD MEMBER [app_orders];

-- Map a managed identity / Entra app as a database user (no password)
CREATE USER [mi-checkout-api] FROM EXTERNAL PROVIDER;
ALTER ROLE db_datareader ADD MEMBER [mi-checkout-api];

The identity models differ in where the principal lives and how it breaks — pick the row your app uses:

Auth model Where the principal lives Connection string shape Fails as Best for
SQL login + DB user Login in master, user in the DB User ID=app;Password=... 18456 if user missing in DB Legacy / cross-DB logins
Contained DB user (SQL) Only in the user DB User ID=app;Password=...;Database=db 18456 if you connect to master Per-DB isolation, portability
Entra user / group Entra + mapped DB user Authentication=Active Directory Password/Interactive 18456 state 102–111 (token) Human admins, MFA
Managed identity Entra + FROM EXTERNAL PROVIDER user Authentication=Active Directory Managed Identity 18456 if mapping missing App-to-DB, no passwords
Service principal Entra app + DB user Authentication=Active Directory Service Principal Token audience/tenant errors CI/CD, automation

The 233 / error occurred during the pre-login handshake error is usually not auth — it’s TLS. Azure SQL enforces TLS 1.2+; a client pinned to TLS 1.0/1.1 or an old driver fails here. Set Encrypt=True, update the driver, and check az sql server show -g rg-data -n myserver --query minimalTlsVersion.

Resource throttling: when the engine sheds load

You’re authenticated, but queries start failing under load. This is resource governance: Azure SQL caps you to your tier and rejects/queues work over it. The codes are unambiguous once you know them.

The throttling error codes

Code Meaning The resource you exhausted Class First lever
10928 Resource ID: %d. The %s limit for the database is %d and has been reached. (Resource ID 1 = worker threads, 2 = sessions) Worker threads or concurrent sessions Data-plane Break blocking; fix pool; scale
10929 The %s minimum guarantee is %d, maximum limit is %d ... However, the server is currently too busy Throttled because the server/elastic pool is saturated; your min guarantee couldn’t be met Data-plane Isolate noisy pool-mate; scale pool
40544 The database has reached its size quota. Storage (data size limit for the tier) Persistent Free space; raise maxSizeBytes
40549 Session terminated because of a long-running transaction A transaction held resources too long Data-plane Shorten/batch transactions
40551 Session terminated due to excessive tempdb usage tempdb Data-plane Reduce sort/hash/temp; add index
40553 Session terminated due to excessive memory usage (e.g. a huge sort/hash) Memory grant Data-plane Cap grant; index; smaller batch
40613 Database '<db>' on server '<server>' is not currently available. Transient — reconfiguration/failover or the database is overloaded and being recycled Transient Retry; if persistent, scale/fix
49918 / 49919 / 49920 Cannot process request: too many operations in progress / subscription has too many DTU/create operations / too many requests for the subscription Control-plane operation throttling (scaling, create) Control-plane Serialise + back off az/ARM

The 49918/49919/49920 family is control-plane throttling — scale/create operations fired faster than the subscription allows, common in automation looping over many databases. Fix by serializing and backing off those az/ARM calls, not by touching the database.

DTU vs vCore — which metric to watch

In the DTU model the single composite metric is DTU percentage — pinned at 100% means throttled, but you can’t tell whether CPU, IO or log was the bottleneck. The vCore model gives separate metrics — CPU percent, Data IO percent, Log IO percent, Workers percent — far more diagnosable, one of the strongest operational arguments for vCore at scale: you see which resource is the ceiling.

The two models, side by side on the dimensions that matter when you’re throttled:

Dimension DTU model vCore model
Resource exposed One blended number (DTU%) Separate CPU / Data IO / Log IO / Workers %
Diagnosability when throttled Poor — can’t tell which resource High — the saturated metric is obvious
Tiers Basic, Standard (S0–S12), Premium (P1–P15) General Purpose, Business Critical, Hyperscale
Scaling granularity Service objective steps Per-vCore, plus tier
Storage scaling Coupled to the objective Independent (esp. Hyperscale)
Best for Small/steady, simple cost Production at scale, tunable, diagnosable
Reserved-capacity discount Limited Yes (1/3-year)

The most commonly-missed governor is log rate. Azure SQL caps the transaction-log generation rate per tier (a few MB/s on small tiers, scaling with size). A bulk insert, index rebuild or chatty write workload saturates Log IO percent at 100% while CPU sits at 20%, and every write waits on LOG_RATE_GOVERNOR/INSTANCE_LOG_GOVERNOR. Engineers see low CPU and conclude “not busy,” when it’s actually log-bound.

Real limit numbers, so you know which ceiling you’re near. These are representative figures (they scale with the exact objective; treat them as the order of magnitude, and read live values from sys.dm_user_db_resource_governance):

Service objective Model Max vCores / DTUs Max concurrent workers Max concurrent sessions Approx log rate Max data size
Basic DTU 5 DTU ~30 ~300 ~0.6 MB/s 2 GB
S0 DTU 10 DTU ~60 ~600 ~0.6 MB/s 250 GB
S2 DTU 50 DTU ~120 ~1,200 ~2.5 MB/s 250 GB
S3 DTU 100 DTU ~200 ~2,400 ~5 MB/s 1 TB
S7 DTU 800 DTU ~1,600 ~19,200 ~20 MB/s 1 TB
P1 DTU 125 DTU ~200 ~30,000 ~12 MB/s 1 TB
P6 DTU 1,000 DTU ~1,600 ~30,000 ~50 MB/s 1 TB
GP_Gen5_2 vCore 2 vCore ~200 ~30,000 ~9 MB/s 1 TB
GP_Gen5_8 vCore 8 vCore ~800 ~30,000 ~36 MB/s 1 TB
BC_Gen5_8 vCore 8 vCore ~800 ~30,000 ~96 MB/s 1 TB+
Hyperscale Gen5_8 vCore 8 vCore ~800 ~30,000 ~100 MB/s (sustained) up to 100 TB

The pattern to remember: workers ≈ 75 × vCores (and step up with DTU tier), sessions cap around 30,000 on most paid tiers but far lower on Basic/Standard-low, and log rate scales with compute size up to a per-tier ceiling (Business Critical roughly doubles General Purpose). When you hit a wall, the number you crossed tells you exactly which to raise.

Confirm (KQL, Azure Monitor / Log Analytics). Find the saturated dimension:

// AzureMetrics: peak utilisation per metric over the last hour for one database
AzureMetrics
| where ResourceProvider == "MICROSOFT.SQL"
| where TimeGenerated > ago(1h)
| where MetricName in ("cpu_percent","log_write_percent","physical_data_read_percent",
                       "workers_percent","sessions_percent","dtu_consumption_percent")
| summarize Max = max(Maximum), Avg = avg(Average) by MetricName, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

The metric names you alert on, and what crossing each one means:

Metric (Azure Monitor) DMV equivalent Saturated → Alert threshold I use The fix it points to
cpu_percent avg_cpu_percent CPU-bound (plans, scans, compute) > 80% for 5 min Tune query / add index / scale CPU
log_write_percent avg_log_write_percent Log-rate governor > 90% for 5 min Batch writes; maintenance window; scale
physical_data_read_percent avg_data_io_percent Data IO (cold reads, missing index) > 80% for 10 min Index; more memory (bigger tier)
workers_percent max_worker_percent Worker exhaustion → 10928 ID 1 > 80% Break blocking; fix pool; scale
sessions_percent max_session_percent Session cap → 10928 ID 2 > 80% Fix pool leak; scale
dtu_consumption_percent (DTU only) Some resource (unknown which) > 85% Move to vCore to see which
tempdb_data_size / spills (plan warnings) tempdb → 40551 trend-based Reduce sort/hash; paginate
storage_percent SUM(size) vs max Size quota → 40544 > 85% Archive; raise maxSizeBytes

Confirm (T-SQL, live). sys.dm_db_resource_stats samples every ~15 seconds for the last hour — the most precise in-database governance view:

-- Last hour, 15-second granularity: which governor is hitting 100%?
SELECT TOP 50
    end_time,
    avg_cpu_percent,
    avg_data_io_percent,
    avg_log_write_percent,          -- log rate governor
    max_worker_percent,             -- worker thread pressure -> 10928
    max_session_percent             -- session cap -> 10928
FROM sys.dm_db_resource_stats
ORDER BY end_time DESC;

For a longer window (14 days, 5-minute granularity) query sys.resource_stats from master. To see what is waiting on a governor right now:

-- Requests currently blocked on resource governors
SELECT session_id, wait_type, wait_time, command, status
FROM sys.dm_exec_requests
WHERE wait_type IN ('LOG_RATE_GOVERNOR','INSTANCE_LOG_GOVERNOR',
                    'SE_REPL_SLOW_SECONDARY_THROTTLE','HADR_THROTTLE_LOG_RATE_SEND_RECV')
ORDER BY wait_time DESC;

The governance DMVs, ranked by how you use them in an incident:

DMV Window / grain Scope Best for Note
sys.dm_db_resource_stats Last 1 hour, ~15 s Current DB “Which governor is at 100% right now” Most precise; your go-to
sys.resource_stats (in master) 14 days, 5 min All DBs on server Trend / capacity planning Coarser; historical
sys.dm_user_db_resource_governance Point-in-time Current DB The actual limit numbers for this objective Confirms caps before scaling
sys.dm_exec_requests (wait_type filter) Live Current DB “Who is waiting on a governor now” Catches LOG_RATE_GOVERNOR live
sys.dm_os_performance_counters Live counters Engine Log flushes, batch req/s Advanced corroboration

Fix, in order: (1) reduce demand — fix the query generating excessive log/IO (batch large deletes, add the missing index so a scan becomes a seek, avoid tempdb spills); (2) scale the tier (more DTUs/vCores raises CPU, log rate and workers together); (3) for worker/session exhaustion, fix pooling. Scaling is a control-plane op:

# Scale a single database up a service objective (raises CPU, log rate, workers together)
az sql db update -g rg-data -s myserver -n orders --service-objective S3      # DTU model
az sql db update -g rg-data -s myserver -n orders \
  --edition GeneralPurpose --family Gen5 --capacity 4                          # vCore: 4 vCores
// vCore General Purpose, 4 vCores, as code
resource db 'Microsoft.Sql/servers/databases@2023-08-01-preview' = {
  parent: sqlServer
  name: 'orders'
  location: location
  sku: { name: 'GP_Gen5', tier: 'GeneralPurpose', family: 'Gen5', capacity: 4 }
  properties: { maxSizeBytes: 268435456000 }   // 250 GB
}

A reduce-demand-vs-scale decision table — because scaling a query bug just makes you pay more to fail later:

Symptom Reduce demand first (cheap) Scale (costs money) When scaling is the right answer
Log IO 100%, CPU low Batch writes; maintenance window; drop redundant indexes Bigger compute raises log rate Legitimately high steady write rate
CPU 100%, IO low Tune the top query (Query Store); add index More vCores/DTUs Genuine compute demand at peak
Data IO 100% Add covering index; reduce scans Bigger tier = more memory to cache Working set legitimately exceeds memory
Workers 100% (10928 ID 1) Break blocking; cut active request count Higher tier raises worker cap Real concurrency need, no blocking
Sessions 100% (10928 ID 2) Fix pool leak; lower Max Pool Size churn Higher tier raises session cap Genuinely many clients
tempdb (40551) Index away sorts; paginate; smaller batches Bigger tier = more tempdb Inherently large aggregations

Worker threads, not connections, are the hidden ceiling. Each active request consumes a worker; the cap scales with tier (low hundreds on small tiers, thousands on large). A pool of 200 firing at once needs 200 workers; add blocking and they pile up waiting, you hit the cap, and new requests fail with 10928 Resource ID: 1. Throttling and blocking are often the same incident.

Blocking, locking, deadlocks & waits

CPU is at 30%, log rate is fine, the network is clean — and the app is dead. This is almost always blocking: one session holds a lock a chain of others wait for. It’s the most lucrative diagnostic skill here — invisible to resource metrics, and the platform won’t fix it for you.

The wait types you must recognise

Every stalled request has a wait_type that names exactly what it is waiting on. Memorise this table — it is the fastest triage in SQL Server, Azure or on-prem:

Wait type What the request is waiting on Points at First action
LCK_M_S A shared lock (a read blocked by a writer) Reader blocked by a long write transaction RCSI; shorten the writer
LCK_M_U An update lock (taken before a modify) Update path contention Shorten transaction; index the predicate
LCK_M_X An exclusive lock (writer vs writer/reader) Write-write contention Find/kill root blocker; consistent order
LCK_M_IX / LCK_M_IS Intent locks (page/table level intent) Lock escalation under big modifies Batch writes; avoid escalation
PAGEIOLATCH_SH / PAGEIOLATCH_EX Reading a data page from storage IO-bound: missing index or too little memory Add index; bigger tier (more cache)
WRITELOG Flushing the transaction log to storage Log-bound (commit-heavy / small batches) Batch writes; check log rate
LOG_RATE_GOVERNOR The per-tier log-rate cap Throttling (governance), not IO Batch; maintenance window; scale
RESOURCE_SEMAPHORE A memory grant for a sort/hash Memory pressure → 40553 Cut grant: index, smaller batch
THREADPOOL A worker thread to even run Worker exhaustion → 10928 ID 1 Break blocking; fix pool; scale
SOS_SCHEDULER_YIELD A CPU scheduler quantum CPU pressure Tune query; scale CPU
ASYNC_NETWORK_IO The client consuming results Slow client / huge result set Page results; fix client consumption
PAGELATCH_* (non-IO) An in-memory page latch (e.g. tempdb hotspot) tempdb allocation contention / hot page Reduce tempdb churn; spread inserts

Reading the dominant wait is the 30-second diagnosis: LCK_M_* → blocking; PAGEIOLATCH_* → IO/missing index; WRITELOG/LOG_RATE_GOVERNOR → log bottleneck; RESOURCE_SEMAPHORE → memory grants; THREADPOOL → worker exhaustion; SOS_SCHEDULER_YIELD → CPU. Each row sends you to a different fix.

The DMV reference matrix

You diagnose Azure SQL from DMVs the way you once tailed errorlog. These are the views this section uses — what each exposes and the question it answers:

DMV Exposes The question it answers Key columns
sys.dm_exec_requests Every executing request right now “Who is running / waiting / blocking whom?” session_id, blocking_session_id, wait_type, wait_time, sql_handle
sys.dm_exec_sessions All sessions (active + sleeping) “Which login/host/program owns this SPID?” session_id, login_name, host_name, program_name, status
sys.dm_tran_locks Every lock held and requested “What lock mode is held vs waited, on which object?” request_session_id, resource_type, request_mode, request_status
sys.dm_os_wait_stats Aggregate waits since reset “What is the database waiting on overall?” wait_type, wait_time_ms, waiting_tasks_count
sys.dm_db_resource_stats Governance utilisation, last hour “Which resource am I throttled on?” avg_cpu_percent, avg_log_write_percent, max_worker_percent
sys.dm_exec_query_stats Aggregated stats per cached plan “Which query burns the most CPU/IO across executions?” total_worker_time, total_logical_reads, execution_count
sys.dm_exec_sql_text(handle) The SQL text for a handle “What is this SPID actually running?” text
sys.dm_exec_query_plan(handle) The cached plan XML “What plan is it using (scan vs seek)?” query_plan
sys.dm_exec_connections Transport-level connection info “Encrypt on? client net address? protocol?” encrypt_option, client_net_address, net_transport
sys.dm_xe_database_session_targets Extended Events target data “Capture deadlock graphs from system_health target_data

See the blocking chain right now

The fastest “who is blocking whom” query joins sys.dm_exec_requests to itself by blocking_session_id:

-- Live blocking chain: blocker -> victim, with the SQL text and wait type
SELECT
    r.session_id            AS victim_session,
    r.blocking_session_id   AS blocker_session,
    r.wait_type,                       -- e.g. LCK_M_X, LCK_M_U
    r.wait_time              AS wait_ms,
    r.command,
    r.status,
    DB_NAME(r.database_id)   AS db,
    SUBSTRING(t.text, (r.statement_start_offset/2)+1,
        ((CASE r.statement_end_offset WHEN -1 THEN DATALENGTH(t.text)
              ELSE r.statement_end_offset END - r.statement_start_offset)/2)+1) AS running_sql
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.blocking_session_id <> 0
ORDER BY r.wait_time DESC;

The session at the head of the chain — blocking_session_id = 0, pointed to by everyone else — is the root blocker. The classic EXEC sp_who2 gives a quick human-readable view (its BlkBy column shows each victim’s blocking SPID), but the DMV query above is what you script. To see the actual locks held and requested (granularity and mode), use sys.dm_tran_locks:

-- What locks is the blocker holding vs the victim waiting for?
SELECT
    l.request_session_id,
    l.resource_type,
    l.resource_database_id,
    l.request_mode,            -- S, U, X, IX, etc.
    l.request_status,          -- GRANT vs WAIT
    OBJECT_NAME(p.object_id, l.resource_database_id) AS object_name
FROM sys.dm_tran_locks l
LEFT JOIN sys.partitions p ON l.resource_associated_entity_id = p.hobt_id
WHERE l.request_session_id IN (<blocker_spid>, <victim_spid>)
ORDER BY l.request_session_id, l.request_status;

The LCK_M_* wait types decode the lock mode: LCK_M_S (shared — a read waiting on a writer), LCK_M_U (update — taken before a modify), LCK_M_X (exclusive — writer vs writer/reader), LCK_M_IX/LCK_M_IS (intent locks). A chain saturated with LCK_M_X/LCK_M_U is write-write contention; lots of LCK_M_S is readers blocked by a long writer — the cue for snapshot isolation / RCSI.

The lock modes and their compatibility — why a reader and a writer collide but two readers don’t:

Lock mode Taken by Compatible with Blocks Typical wait seen
S (Shared) Reads other S, IS X, U (writers) LCK_M_S
U (Update) The “about to modify” scan S, IS U, X LCK_M_U
X (Exclusive) INSERT/UPDATE/DELETE nothing everything LCK_M_X
IS (Intent Shared) Higher-level intent to read S, IS, IX X LCK_M_IS
IX (Intent Exclusive) Higher-level intent to write IS, IX S, U, X LCK_M_IX
Sch-M (Schema Modify) DDL / index rebuild nothing all access LCK_M_SCH_M

Fix the immediate incident. Decide whether the root blocker is a legitimate long transaction (let it finish, or fix the app to commit sooner) or a stuck/abandoned one (open transaction from a crashed client), then kill it to release the chain:

KILL <root_blocker_session_id>;   -- releases its locks; the chain drains

Fix the root cause. Blocking is almost always: a transaction doing too much between BEGIN TRAN and COMMIT (especially waiting on app/user input while holding locks — the cardinal sin), a missing index forcing a scan that over-locks, lock escalation (row → table) under large modifications, or the wrong isolation level. Durable fixes: shorten transactions, add the covering index, batch large writes, and adopt Read Committed Snapshot Isolation so reads use row-versioning instead of shared locks:

-- RCSI: readers stop blocking writers (uses tempdb row-versioning)
ALTER DATABASE [orders] SET READ_COMMITTED_SNAPSHOT ON;

The root causes of blocking and their durable fixes, ordered by how often they’re the culprit:

Root cause Tell-tale Durable fix Trade-off
Transaction held across app/user/network I/O Long wait_time, blocker idle (sleeping, open tran) Commit before I/O; never BEGIN TRAN around a user prompt App refactor
Missing index → scan over-locks Plan shows a scan; many rows locked Add covering/seek index Index maintenance + storage
Lock escalation (row → table) LCK_M_IX/table-level X on big modifies Batch writes (< 5,000 rows); partition More batches, slight overhead
Wrong isolation (readers block writers) Lots of LCK_M_S RCSI / snapshot isolation tempdb row-version store usage
Long single statement (big delete/update) One statement, huge runtime Batch + WAITFOR DELAY More elapsed time, less contention
Abandoned/orphaned transaction Blocker status = sleeping, open tran, no command KILL; fix client to dispose Lost work in that transaction

Aggregate waits — what is the database waiting on overall?

When the question is “what is the general bottleneck,” sys.dm_os_wait_stats aggregates every wait since the last reset. Filter benign system waits and see what dominates:

-- Top real waits (exclude benign/idle waits), as a % of total
WITH waits AS (
  SELECT wait_type, wait_time_ms, waiting_tasks_count,
         100.0 * wait_time_ms / SUM(wait_time_ms) OVER() AS pct
  FROM sys.dm_os_wait_stats
  WHERE wait_type NOT IN (
    'CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK',
    'SLEEP_SYSTEMTASK','SQLTRACE_BUFFER_FLUSH','WAITFOR','LOGMGR_QUEUE',
    'CHECKPOINT_QUEUE','REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT',
    'BROKER_TO_FLUSH','BROKER_TASK_STOP','CLR_MANUAL_EVENT','CLR_AUTO_EVENT',
    'DISPATCHER_QUEUE_SEMAPHORE','XE_DISPATCHER_WAIT','XE_DISPATCHER_JOIN',
    'SQLTRACE_INCREMENTAL_FLUSH_SLEEP','HADR_FILESTREAM_IOMGR_IOCOMPLETION')
)
SELECT TOP 20 wait_type, waiting_tasks_count, wait_time_ms, CAST(pct AS DECIMAL(5,2)) AS pct
FROM waits ORDER BY wait_time_ms DESC;

Reading the result: dominant LCK_M_* → blocking (back to the chain query). PAGEIOLATCH_* → reading pages from storage (IO-bound / missing index / too little memory to cache the working set). WRITELOG/LOG_RATE_GOVERNOR → log bottleneck (throttling). SOS_SCHEDULER_YIELD → CPU pressure. RESOURCE_SEMAPHORE → memory-grant starvation (big sorts/hashes queuing). THREADPOOL → worker exhaustion (10928). Each points at a different fix; this is the best 30-second triage in SQL.

Deadlocks — capturing the graph

A deadlock is mutual blocking the engine resolves by killing one session as the deadlock victim (error 1205: Transaction was deadlocked on lock resources and has been chosen as the deadlock victim. Rerun your transaction.). Unlike plain blocking, deadlocks auto-resolve — but they fail transactions and must be designed out. Azure SQL captures recent deadlocks in the system_health Extended Events ring buffer:

-- Extract recent deadlock graphs from the system_health XEvent session
SELECT
    xed.value('@timestamp','datetime2')               AS deadlock_time,
    xed.query('.')                                     AS deadlock_graph
FROM (
  SELECT CAST(target_data AS XML) AS tx
  FROM sys.dm_xe_database_session_targets t
  JOIN sys.dm_xe_database_sessions s ON s.address = t.event_session_address
  WHERE s.name = 'system_health' AND t.target_name = 'ring_buffer'
) AS x
CROSS APPLY tx.nodes('//RingBufferTarget/event[@name="xml_deadlock_report"]') AS q(xed)
ORDER BY deadlock_time DESC;

The graph XML shows both processes, the resources each held and requested, and the victim. Blocking and deadlock are different beasts and need different fixes — don’t conflate them:

Aspect Blocking Deadlock
Definition One session waits on another’s lock Two+ sessions wait on each other’s locks
Resolution Persists until the blocker releases Engine auto-kills a victim (1205)
Error surfaced None (just slow / timeout) 1205 to the victim
Detect with sys.dm_exec_requests (blocking_session_id) system_health deadlock graph
Immediate fix KILL the root blocker Victim auto-rolled back; retry it
Durable fix Shorten transactions; index; RCSI Consistent lock order; short trans; index
Retry-able? No (fix the cause) Yes — 1205 is transient

The fixes: access objects in a consistent order across transactions (the cure for the “A-then-B vs B-then-A” deadlock), keep transactions short, add indexes so fewer rows lock, and consider RCSI/snapshot for read-write deadlocks.

Query Store & Intelligent Insights — finding the regressed query

DMVs show now; Query Store shows over time, persisting query text, plans and runtime stats so you can prove “this regressed at 09:00 when the plan changed.” On by default for new databases. The most useful query is “top resource consumers in the last day”:

-- Top 20 queries by total duration over the last 24h, with plan count
SELECT TOP 20
    qt.query_sql_text,
    rs.count_executions,
    CAST(rs.avg_duration/1000.0 AS DECIMAL(10,2))    AS avg_ms,
    CAST(rs.avg_cpu_time/1000.0 AS DECIMAL(10,2))    AS avg_cpu_ms,
    rs.avg_logical_io_reads,
    COUNT(DISTINCT p.plan_id)                        AS plans
FROM sys.query_store_runtime_stats rs
JOIN sys.query_store_plan p           ON p.plan_id = rs.plan_id
JOIN sys.query_store_query q          ON q.query_id = p.query_id
JOIN sys.query_store_query_text qt    ON qt.query_text_id = q.query_text_id
JOIN sys.query_store_runtime_stats_interval i ON i.runtime_stats_interval_id = rs.runtime_stats_interval_id
WHERE i.start_time > DATEADD(HOUR, -24, SYSUTCDATETIME())
GROUP BY qt.query_sql_text, rs.count_executions, rs.avg_duration, rs.avg_cpu_time, rs.avg_logical_io_reads
ORDER BY rs.avg_duration * rs.count_executions DESC;

The Query Store metrics and what each surfaces — pick the dimension that matches your symptom:

Query Store metric Surfaces Use when Mitigation it leads to
avg_duration × count_executions The overall worst offender (total time) “What’s eating the database’s time?” Tune/index the top query
avg_cpu_time CPU-heavy queries CPU governor at 100% Reduce scans; better plan
avg_logical_io_reads Buffer-pool churn / missing index Data IO pressure Covering index
avg_tempdb_space_used Sort/hash spills tempdb (40551) Smaller grants; paginate
avg_rowcount Over-fetching / cartesian blow-ups Huge result sets Add predicates / TOP
count_compiles Recompile storms / no plan reuse Plan-cache thrash Parameterise; avoid RECOMPILE
Distinct plan_id count (plans > 1) Plan regression candidate “Fast yesterday, slow today” Force the good plan

A query with plans > 1 is a plan regression candidate — Query Store lets you force the previously-good plan (sys.sp_query_store_force_plan) without touching code, the fastest mitigation after a stats change or parameter-sniffing flip.

Intelligent Insights and the Performance overview (portal Intelligent Performance blade) are the managed layer: Azure continuously analyses the workload and raises diagnostics (“queries waiting on locks,” “tempdb contention,” “plan regression on query_id X”) plus missing-index and high-DTU recommendations. Database Advisor automatic tuning can even create/drop indexes and force plans. The recommendation types it raises, and what each costs you to accept:

Recommendation (Advisor / Insights) What it detects The action Risk of auto-applying
CreateIndex A missing index that would cut a scan Creates a nonclustered index Write overhead; storage
DropIndex Duplicate/unused index Drops it Could hurt an unseen query — review
ForceLastGoodPlan A regressed plan vs a prior good one Forces the good plan Stale if stats change again
“Queries waiting on locks” Blocking pattern (diagnostic only) None — it just flags
“tempdb contention” Sort/hash spills (diagnostic only) None
“High DTU consumers” Top resource queries (diagnostic only) None

Pull active recommendations:

# Performance recommendations Azure has generated for the database
az sql db advisor list -g rg-data -s myserver -d orders -o table
# Drill into one (e.g. createIndex) to see the recommended action
az sql db advisor show  -g rg-data -s myserver -d orders -n createIndex

Client-side: pooling, retries, timeouts

A large share of “Azure SQL is unreliable” tickets are client-side. Three issues dominate.

Connection-pool exhaustion. Drivers (ADO.NET/Microsoft.Data.SqlClient, JDBC) keep a pool per unique connection string (default max pool size 100 in ADO.NET). If code leaks connections, holds them across slow external calls, or blocking makes each live longer, the pool drains. The error is unmistakable: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool ... all pooled connections were in use and max pool size was reached. This is a client pool timeout, not a database timeout — the database may be idle. Fixes: dispose connections (using/try-with-resources), don’t hold them across I/O, fix upstream blocking (a frequent cause of pool exhaustion), and only then raise Max Pool Size.

Transient-fault retry with backoff. Because of reconfigurations, your app must retry the transient set and must not blindly retry persistent errors. Microsoft.Data.SqlClient ships configurable retry logic (SqlConnection.RetryLogicProvider); EF Core has EnableRetryOnFailure(). The shape is exponential backoff with jitter, capped attempts — never a tight loop, which amplifies a reconfiguration into a thundering herd that delays recovery.

The retry/no-retry decision is binary and must be encoded in the app. Get this list right and “flaky DB” tickets disappear:

Error Retry? Why Backoff guidance
40613, 40197, 40143 Yes Reconfiguration / failover (seconds) Exp. backoff + jitter, 3–5 attempts
10928, 10929 Yes Transient governance / server busy Backoff; if persistent, scale
10053, 10054, 10060 Yes Transport drop / connect timeout Backoff; also check the redirect trap
233 Yes (then fix TLS) Handshake blip / TLS Retry once; if persistent fix Encrypt/driver
1205 Yes Deadlock victim — re-runnable Immediate retry (small jitter)
40615, 40914 No Firewall — fails identically Fix the rule; do not loop
18456, 40532 No Auth — fails identically Fix credential/identity
40544 No Size quota — fails identically Free space / scale storage
49918/49919/49920 No (in the DB path) Control-plane — fix the pipeline Serialise management calls

The retry settings to configure, and what each controls:

Setting Where Typical value What it controls
NumberOfTries / max attempts SqlRetryLogicOption / EF maxRetryCount 5 How many total tries
DeltaTime / base delay SqlRetryLogicOption / EF maxRetryDelay 1–5 s Base of the exponential backoff
MaxTimeInterval SqlRetryLogicOption 30 s Cap on a single backoff
Jitter Built into provider on Spreads retries to avoid a herd
TransientErrors list SqlRetryLogicOption the set above Which codes are retried
EF errorNumbersToAdd EnableRetryOnFailure(...) extras Add codes EF doesn’t retry by default

Connection timeout vs command timeout — conflated constantly. The connection timeout (ADO.NET Connect Timeout, default 15s; JDBC loginTimeout) bounds establishing the connection (DNS+TCP+TLS+login); hitting it means a Hops 1–4 (network/auth) problem, and raising it just delays the error. The command timeout (SqlCommand.CommandTimeout, default 30s) bounds a single query; hitting it means the query is slow (blocking, missing index, parameter sniffing) — an engine/query fix, not network. The classic mistake is bumping the connection timeout to “fix” slow queries (does nothing) or the command timeout to mask blocking (hides it until worse). Diagnose which timeout fired, then fix that layer.

The full client timeout/pool settings reference — these four values cause most “timeout” confusion:

Setting ADO.NET keyword Default Bounds Hitting it means Don’t use it to…
Connection timeout Connect Timeout / Connection Timeout 15 s DNS + TCP + TLS + login Hops 1–4 problem (network/auth) …mask a slow query (it won’t)
Command timeout Command Timeout (or CommandTimeout) 30 s One query’s execution Slow query (blocking/index/plan) …mask blocking (it hides it worse)
Max pool size Max Pool Size 100 Concurrent pooled connections Pool exhaustion (“from the pool”) …paper over leaked connections
Min pool size Min Pool Size 0 Warm connections kept open (perf) avoids cold connect spikes
Pool lifetime Load Balance Timeout / Connection Lifetime 0 (off) Max connection age Rebalancing after failover
Encrypt Encrypt True (modern drivers) TLS on the wire 233 if mismatched
Multiple Active Result Sets MultipleActiveResultSets False Interleaved result sets Pool pressure if misused
# Quick connectivity isolation test with sqlcmd, with explicit short login timeout (-l)
# If THIS fails fast, it's hops 1-4 (network/firewall/auth), not a slow query.
sqlcmd -S tcp:myserver.database.windows.net,1433 -d orders \
  -G -l 15 -Q "SELECT 1"      # -G = Entra auth; -l 15 = 15s login timeout

The diagnostic toolkit: which tool answers which question

Knowing where to look is half the battle. Before the deep work, the tools matrix — what each shows, how to reach it, the plane it lives in, and what it’s best for:

Tool What it shows Plane How to access Best for
nslookup / Resolve-DnsName Name → IP (public vs private) Network Any client Private Link DNS (Hop 1)
nc -vz / Test-NetConnection Port reachability Network VNet client The 11000–11999 redirect trap
az sql server firewall-rule list Server firewall rules Control CLI / Cloud Shell 40615 (Hop 2)
az sql server conn-policy show Redirect/Proxy/Default Control CLI The connection-policy hop (Hop 3)
sqlcmd -l 15 -Q "SELECT 1" Fast connect isolation Data CLI Network/auth vs slow query
sys.dm_db_resource_stats Governance utilisation (1 h) Data T-SQL Throttling (which governor)
sys.dm_exec_requests Live requests + blocking Data T-SQL Blocking chains, live waits
sys.dm_os_wait_stats Aggregate waits Data T-SQL “What’s the bottleneck overall?”
sys.dm_tran_locks Locks held/requested Data T-SQL Lock mode/object in a block
system_health ring buffer Deadlock graphs Data T-SQL 1205 deadlocks
Query Store views Query/plan history Data T-SQL / portal Plan regression, top consumers
AzureMetrics (KQL) Metric trends Telemetry Log Analytics Alerts on log/workers/CPU
Intelligent Insights / Performance overview Managed diagnostics Telemetry Portal Auto-found regressions, missing indexes
az sql db advisor list Tuning recommendations Control CLI / portal Index + plan recommendations
Microsoft Defender for SQL Threat alerts Security Portal Injection, anomalous logins

The same tools, mapped to the symptom they crack — start from your symptom, not from the tool:

If your symptom is… Reach for Because it shows
“Can’t connect, only from the VNet” nslookup + nc -vz DNS target + which ports actually open
“40615 in the logs” az sql server firewall-rule list Whether the egress IP is allowed
“Connects then hangs from a spoke” az sql server conn-policy show + nc 11000 Redirect policy + blocked high ports
“Login failed” sys.database_principals (right DB) Whether the principal exists where you connect
“Timeouts under load” sys.dm_db_resource_stats Which governor is at 100%
“App halts, CPU fine” blocking-chain DMV + sys.dm_tran_locks The root blocker and the lock mode
“Fast yesterday, slow today” Query Store (plans > 1) A plan regression to force away
“Random 1205s” system_health deadlock graph The two processes and lock order
“Timeout from the pool” App pool config + upstream blocking A client-side, not database, problem

Architecture at a glance

The diagram lays out the connection path as a horizontal pipeline — the order a request travels and the order you diagnose in. Left to right across the top: the Client / App (retry logic, pool size) reaches DNS to resolve *.database.windows.net, passes the Firewall (server rules, VNet rules, Private Endpoint), hits the Gateway where the redirect vs proxy decision is made, lands on the Database (login, then DTU/vCore limits), and runs a Query that may stall on waits and blocking. Cyan arrows trace the happy path; dashed red arrows drop from each hop to the failure it produces.

The bottom row is the troubleshooting key: transient → retry; wrong CNAME / Private DNS → DNS hop; 40615 IP not allowed → firewall hop; port 11000–11999 blocked → connection-policy hop (the on-prem/peered-VNet trap); 10928/10929 → governance throttling; LCK_M_X / sp_who2 → blocking. The caption is the whole discipline — walk the path left to right and the first failing layer owns the incident — so once DNS resolves and the firewall lets you through, you never look at them again for that incident.

Azure SQL Database connection path shown as a left-to-right pipeline: Client/App with retry logic and pool size, then DNS resolving star-dot-database-dot-windows-dot-net, then Firewall (server plus VNet rules or Private Endpoint), then Gateway making the redirect-versus-proxy decision, then Database (login and DTU/vCore limits), then Query (waits and blocking). Cyan arrows show the request flow; dashed red arrows drop from each hop to its failure mode below — transient/retry, wrong CNAME or Private DNS, error 40615 IP not allowed, port 11000-11999 blocked, 10928/10929 throttle, and LCK_M_X with sp_who2 blocking — under a caption reading: walk the path left to right, the first failing layer owns the incident

To anchor the diagram, here is each hop with the exact instrument and the single most common failure it produces — read it alongside the picture:

Pipeline node Owns these errors One instrument Most common single failure
Client / App “from the pool”, no-retry flakiness App config Per-request connection, no retry
DNS resolves to public IP nslookup privatelink zone not linked
Firewall 40615, 40914 az sql server firewall-rule list NAT egress IP not allowed
Gateway / policy 10060, 10054 conn-policy show + nc 11000 Redirect + blocked 11000–11999
Database (login) 18456, 40532, 233 sys.database_principals User exists only in user DB, connected master
Database (governance) 10928/10929/4055x/40613 sys.dm_db_resource_stats Log-rate or worker exhaustion
Query (locks) blocking, 1205 blocking-chain DMV A long writer holding LCK_M_X

Real-world scenario

Northwind Logistics runs a parcel-tracking platform on Azure. Its core database tracking sits on a single logical server nwl-sql-prod.database.windows.net — a vCore General Purpose, Gen5, 8 vCores instance holding ~600 GB. An App Service web tier and a VM Scale Set of scanner-ingestion workers (in vnet-app) both hit it. Average load is comfortable (CPU ~35%); they process ~9 million scan events/day, bursting to ~600/second during the afternoon delivery wave.

The incident began at 16:05 on a Friday during peak. The ingestion API started throwing 40613 Database not currently available and Timeout expired ... obtaining a connection from the pool in roughly equal measure; the customer website threw 40615 for a subset of users behind one corporate proxy. The on-call engineer’s first instinct — “the database is down” — was wrong: the Performance overview showed it online, and four internal dashboards were fine.

They walked the path. DNS resolved correctly to the private endpoint (Private Link added two weeks earlier). Firewall: the 40615s were a red herring from a separate change — a corporate proxy’s egress IP had rotated out of the server firewall, affecting only a slice of website users, unrelated to ingestion. The real incident was at the query layer. The blocking-chain DMV query revealed one root blocker: a nightly archival job triggered early, running an un-batched DELETE FROM scan_events WHERE created < @cutoff — ~40 million rows in one transaction. It escalated to table locks and saturated the log rate (Log IO 100%, CPU 38%). Ingestion writes piled up behind LCK_M_X and LOG_RATE_GOVERNOR waits; each held a worker and a pooled connection; within four minutes the pool exhausted and worker count neared the cap, producing 10928 and 40613 as the engine shed load.

Resolution took eleven minutes once they saw the chain. They KILLed the runaway delete; locks released, the chain drained, the pool refilled, and ingestion recovered in ~90 seconds. They added the rotated proxy IP to clear the website 40615s. Durable fixes: rewrote the delete to batch in 5,000-row chunks with WAITFOR DELAY to stay under the log governor; added a non-clustered index on created; added retry-with-backoff for the transient set; restricted the archival job to a maintenance window; and added Azure Monitor alerts on log_write_percent > 90% for 5 minutes and on blocking. Net effect: an incident that took 3+ hours the first time now self-alerts and is a sub-15-minute fix.

The incident as a timeline, because the order of the symptoms is itself the lesson — one root cause cascaded into four different error codes:

Time What surfaced What the team thought What it actually was Right move
16:05 40613 + pool timeouts on ingestion “Database is down” Engine shedding load Check Performance overview (it was online)
16:06 40615 for some website users “Firewall broke too” Unrelated rotated proxy IP (red herring) Park it; chase the main incident
16:08 Ran the blocking-chain DMV One root blocker: a 40M-row delete Found the chain head
16:09 LOG_RATE_GOVERNOR + LCK_M_X waits “Network/throttle?” Un-batched delete saturating log + locking Confirm via dm_db_resource_stats
16:11 10928 Resource ID 1 appearing “Scale up now” Workers piling behind the blocker Kill the blocker, don’t scale
16:14 KILL the delete Chain drained Ingestion recovered in ~90 s
+1 week Durable fixes shipped Batched delete, index, retries, alerts 3-hour incident → sub-15-min

Advantages and disadvantages

The trade-off here is the managed troubleshooting model — diagnosing Azure SQL from codes/DMVs/metrics versus the old on-prem “RDP in and look at the box.”

Advantages of the Azure SQL diagnostic model Disadvantages / friction
Rich, precise error numbers tell you the exact resource/limit you hit (10928 → workers, 40551 → tempdb) No host access — you cannot restart a service, tail errorlog on disk, or run host-level tools; you diagnose only through the data plane and metrics
DMVs, Query Store and system_health XEvents are always on and queryable live DMV scope is per-database in Azure SQL; some server-wide DMVs/instance-level features behave differently than boxed SQL Server
The platform handles HA/failover for you (reconfigurations) — fewer real outages Those same reconfigurations cause transient errors you must code retry logic for; ignore them and the app looks flaky
Resource governance prevents one workload from taking the server down hard Governance also throttles you at the tier’s limits (log rate especially) — you hit ceilings you didn’t choose and must scale or optimise
Intelligent Insights / Automatic Tuning find regressions and missing indexes automatically Auto-tuning can surprise you (a forced plan or auto-created index) if you don’t watch its actions
Private Link + firewall + VNet rules give strong network isolation The same controls add failure modes (DNS, the redirect/proxy port trap) that don’t exist for a simple LAN connection to a local server

The advantages dominate for teams that want uptime without running infrastructure — the diagnostics are more precise than grovelling through host logs. The disadvantages bite hardest on migrations (on-prem assumptions about always-available connections, 1433-only firewalls and host access carry over and break) and cost-constrained tiers (governance limits — especially log rate and workers — hit far sooner than teams expect). Concretely, which workloads feel which edge of the trade-off:

Workload Feels the advantage as… Feels the disadvantage as…
Greenfield cloud-native app Precise codes + retries = robust by design Must learn the redirect/proxy trap once
On-prem migration (“lift & shift”) Managed HA/backups, no patching 1433-only firewalls + always-on assumptions break
Cost-sensitive small tier Cheap, serverless auto-pause Log-rate/worker governors hit early
Large analytical/batch Hyperscale + separate metrics tempdb/memory/log governors on big jobs
High-concurrency OLTP Governance prevents a meltdown Worker cap + blocking can cascade fast
Strict-compliance data Private Link + Defender + auditing DNS/private-path failure modes to manage

The redirect/proxy port trap and the mandatory retry logic are the two things every team underestimates.

Hands-on lab

This lab provisions a small Azure SQL Database, then deliberately triggers and diagnoses two failure families: a firewall block (40615) and a blocking chain. It uses General Purpose serverless so it auto-pauses and costs almost nothing. Run az in Cloud Shell and T-SQL in the portal Query editor or sqlcmd. The phases at a glance:

Step Goal Key command / query Expected outcome
1 Provision serverless DB az sql db create ... --compute-model Serverless Server + DB Succeeded
2 Reproduce + fix 40615 sqlcmd ... SELECT 1 → add firewall rule 40615, then 1
3 Inspect connection policy az sql server conn-policy show Default
4 Create a blocking chain two sessions, one uncommitted UPDATE Session 2 hangs
5 Diagnose + resolve blocking-chain DMV → KILL Chain drains
6 Watch governance sys.dm_db_resource_stats Live CPU/log/worker %
7 Teardown az group delete Resources removed

Step 1 — Create a server and a tiny serverless database.

RG=rg-sql-lab; LOC=eastus; SRV=nwl-sqllab-$RANDOM; DB=lab
az group create -n $RG -l $LOC -o table
az sql server create -g $RG -n $SRV -l $LOC \
  --admin-user sqladminuser --admin-password 'P@ssw0rd-$(openssl rand -hex 4)!'
# Serverless GP, auto-pause after 60 min idle, 1 min vCore -> near-zero cost
az sql db create -g $RG -s $SRV -n $DB \
  --edition GeneralPurpose --family Gen5 --capacity 1 --compute-model Serverless \
  --auto-pause-delay 60 -o table

Server and database both report Succeeded. The FQDN is $SRV.database.windows.net.

Step 2 — Reproduce a firewall block (40615), then fix it. Connect without a firewall rule:

sqlcmd -S tcp:$SRV.database.windows.net,1433 -d $DB \
  -U sqladminuser -P '<the password you set>' -Q "SELECT 1"
# Expected: Msg 40615 — Cannot open server ... Client with IP address 'x.x.x.x' is not allowed.

Read the IP from the error, add it, and reconnect:

MYIP=$(curl -s ifconfig.me)
az sql server firewall-rule create -g $RG -s $SRV -n cloudshell \
  --start-ip-address $MYIP --end-ip-address $MYIP
sqlcmd -S tcp:$SRV.database.windows.net,1433 -d $DB \
  -U sqladminuser -P '<password>' -Q "SELECT 1"   # now returns 1

Step 3 — Inspect the connection policy (the redirect/proxy hop).

az sql server conn-policy show -g $RG -s $SRV --query "connectionType" -o tsv   # Default

Step 4 — Create a table and trigger a blocking chain. In two separate query sessions:

-- Setup (session 1)
CREATE TABLE dbo.accounts(id INT PRIMARY KEY, balance INT);
INSERT dbo.accounts VALUES (1,100),(2,100);

-- Session 1: open a transaction and DON'T commit (the blocker)
BEGIN TRAN;
UPDATE dbo.accounts SET balance = balance - 10 WHERE id = 1;
-- leave this session open, do not commit
-- Session 2: try to update the same row -> it BLOCKS on session 1
UPDATE dbo.accounts SET balance = balance + 10 WHERE id = 1;   -- hangs

Step 5 — Diagnose and resolve the block (a third session).

-- Find the chain: victim, blocker, wait type
SELECT session_id, blocking_session_id, wait_type, wait_time, command
FROM sys.dm_exec_requests WHERE blocking_session_id <> 0;
-- Expected: session 2 blocked by session 1 with wait_type LCK_M_U / LCK_M_X

-- Resolve: either COMMIT in session 1, or kill the blocker:
-- KILL <session_1_id>;

After committing/killing session 1, session 2 completes instantly — you’ve produced and read a real blocking chain.

Step 6 — Watch resource governance.

SELECT TOP 5 end_time, avg_cpu_percent, avg_log_write_percent, max_worker_percent
FROM sys.dm_db_resource_stats ORDER BY end_time DESC;

Validation checklist. You triggered and fixed a 40615, confirmed the connection policy, produced and resolved a blocking chain via the DMV workflow, and viewed live governance metrics — the full diagnostic loop end to end.

Teardown (do this — serverless still carries a tiny storage cost).

az group delete -n $RG --yes --no-wait

Cost note. Serverless GP with a 1-vCore minimum and auto-pause bills only for the compute-seconds used plus a few rupees of storage; deleting the resource group removes everything. Net cost: a few rupees.

Common mistakes & troubleshooting

The playbook. Scan the master table first to localise the incident to a row, then read the matching detail entry below for the exact commands. Each detail entry follows the same shape: symptom → root cause → confirm (exact command/path) → fix.

# Symptom Tell-tale signal Confirm (exact cmd / portal) Fix Band-aid that masks it
1 Works from Cloud Shell, times out from peered VNet Connects then hangs only from the VNet conn-policy show = Redirect; nc 11000 times out Force Proxy or open 11000–11999 to Sql tag Raising the command timeout
2 40615 IP not allowed The egress IP printed in the error az sql server firewall-rule list lacks it Add the egress IP, or VNet rule / Private Link “Allow Azure services” (0.0.0.0) — too broad
3 Private Link added, now VNet times out nslookup returns a public IP az network private-dns link vnet list shows no link Link the zone; fix DNS forwarder; set Proxy Re-enabling public access
4 18456 Login failed Same login fails identically every try sys.database_principals in the target DB Connect to right DB; create/map the principal Retrying (it’s persistent)
5 40613 not currently available Clears in <1 min with a metric gap If persists, dm_db_resource_stats pinned Retry w/ backoff; if persistent, scale/fix load Restart loop in the app
6 10928 Resource ID: 1 max_worker_percent near 100% Blocking-chain DMV + dm_db_resource_stats Break blocking; fix pool; scale up Scaling up while a blocker runs
7 Writes crawl, CPU low avg_log_write_percent = 100%, low CPU dm_exec_requests shows LOG_RATE_GOVERNOR Batch writes; maintenance window; scale Bigger command timeout
8 Timeout ... from the pool The phrase “from the pool” DB metrics healthy; app missing using/dispose Dispose; don’t hold across I/O; fix blocking Raising Max Pool Size first
9 Fast yesterday, slow today No code change; latency jump Query Store top-consumers shows plans > 1 Force the good plan; update stats; index OPTION(RECOMPILE) everywhere
10 1205 deadlock victim Periodic 1205s under load system_health deadlock graph Consistent lock order; short trans; index; retry Just retrying without fixing order
11 40544 size quota reached Writes fail; storage at max SUM(size)*8/1024 vs tier max; Overview blade Archive/delete; raise maxSizeBytes/scale Retrying (it’s persistent)
12 49918/49919/49920 too many ops The failing call is az/ARM, not a query Message names the subscription Serialise + back off management calls Touching the database
13 233 pre-login handshake Fails before auth; old client --query minimalTlsVersion = 1.2 Encrypt=True; upgrade driver; enable TLS 1.2 Lowering the TLS floor
14 40551 / 40553 tempdb / memory Session killed mid-aggregate RESOURCE_SEMAPHORE waits; plan spills Index away sorts; paginate; cap grant Scaling without fixing the query

1. “Works from Cloud Shell / on-prem, times out from the peered VNet.” Cause: connection policy is Redirect (the in-Azure default) and the subnet reaches the gateway on 1433 but not 11000–11999 (blocked by a hub firewall, NSG, or forced-tunnel UDR) — handshake succeeds, redirect fails. Confirm: az sql server conn-policy show -g rg -s srv --query connectionType returns Default/Redirect; from the VNet host nc -vz srv.database.windows.net 1433 works but nc -vz srv.database.windows.net 11000 times out. Fix: force Proxy (az sql server conn-policy update --connection-type Proxy) or allow 1433 11000-11999 outbound to the Sql service tag.

2. 40615 — Client with IP address 'x.x.x.x' is not allowed. Cause: the public egress IP the gateway saw (NAT/proxy IP for VNet traffic, not the private IP) isn’t in any server- or database-level firewall rule. Confirm: the IP is in the error text; az sql server firewall-rule list -g rg -s srv -o table and SELECT * FROM sys.database_firewall_rules don’t contain it. Fix: add the egress IP as a rule, or use a VNet rule/Private Link to allow the subnet instead of chasing rotating NAT IPs.

3. Private Link added, now it times out from inside the VNet. Cause: DNS still resolves to the public gateway IP — the privatelink.database.windows.net zone isn’t linked to the VNet (or custom DNS doesn’t forward to 168.63.129.16), or Redirect is fighting the private path. Confirm: nslookup srv.database.windows.net from the VNet returns a public IP; az network private-dns link vnet list -g rg -z privatelink.database.windows.net shows no link for that VNet. Fix: link the zone to the VNet, ensure the private-endpoint A record exists, forward custom DNS to Azure DNS, and set connection policy to Proxy.

4. 18456 — Login failed for user. Cause: wrong password; or a contained user that exists only in the user DB but you connected to master; or an Entra token for the wrong tenant/audience. Confirm: SELECT name FROM sys.database_principals in the target database; az sql server ad-admin list -g rg -s srv for token auth; check you’re on the right database, not master. The state number in the message narrows it (state 8 = bad password, 38/40 = wrong/no DB, 102–111 = Entra token). Fix: connect to the correct database; create the contained user / map the Entra principal; for SQL auth use a user/login that actually exists.

5. 40613 — Database '<db>' is not currently available. Cause: usually transient — a reconfiguration/failover (seconds) — or the database is so overloaded it’s being recycled. Confirm: clears within a minute with a brief metric gap → reconfiguration; persists → check sys.dm_db_resource_stats for pinned CPU/log/workers (overload). Fix: retry with exponential backoff; if persistent, treat as overload — scale or fix the saturating workload.

6. 10928 — Resource ID: 1 ... limit for the database is N and has been reached. Cause: worker-thread (Resource ID 1) or session (Resource ID 2) exhaustion — too many concurrent active requests, often because of blocking making each live longer, or a pool storm. Confirm: SELECT max_worker_percent, max_session_percent FROM sys.dm_db_resource_stats ORDER BY end_time DESC near 100%; run the blocking-chain query — a long blocker is frequently the cause. Fix: break the blocking chain (KILL root blocker / shorten transactions), fix pooling, and/or scale up (raises worker/session caps).

7. Writes crawl but CPU is low; WRITELOG / LOG_RATE_GOVERNOR waits dominate. Cause: the log generation rate is governed at the tier cap — a bulk insert, big un-batched delete or index rebuild saturates Log IO percent while CPU idles. Confirm: SELECT avg_log_write_percent FROM sys.dm_db_resource_stats pinned at 100% with low avg_cpu_percent; sys.dm_exec_requests shows LOG_RATE_GOVERNOR. Fix: batch large writes with small delays, add indexes so modifies touch fewer rows, do maintenance in a window, or scale (log rate scales with size).

8. App throws Timeout expired ... obtaining a connection from the pool. Cause: client pool exhaustion (default max 100 in ADO.NET) — leaked connections, held across slow calls, or blocking making them live longer. A client timeout; the database may be idle. Confirm: the error says “from the pool”; sys.dm_db_resource_stats looks healthy; check app code for missing using/dispose. Fix: dispose promptly, don’t hold across I/O, fix upstream blocking, and only then raise Max Pool Size.

9. A query fast yesterday is suddenly slow (no code change). Cause: plan regression — a stats update or parameter-sniffing flip gave a worse plan; or a missing/changed index. Confirm: the Query Store top-consumers query shows plans > 1; compare in the portal Query Store → Regressed Queries. Fix: force the good plan (sys.sp_query_store_force_plan), update stats, or add the covering index; consider OPTIMIZE FOR/RECOMPILE for sniffing-sensitive queries.

10. 1205 — chosen as the deadlock victim. Cause: two+ transactions acquired locks in opposite order and deadlocked; the engine killed the cheaper one. Confirm: extract the deadlock graph from system_health (the XEvent ring-buffer query above) to see both processes, resources and the victim. Fix: access objects in a consistent order, keep transactions short, add indexes to reduce locked rows, adopt RCSI/snapshot for read-write deadlocks, and retry 1205 (transient).

11. 40544 — The database has reached its size quota. Cause: the database hit the max data size for its tier — not transient; data-growing writes fail. Confirm: SELECT SUM(size)*8/1024 AS used_mb FROM sys.database_files vs the tier max; portal Overview shows used/max storage. Fix: free space (archive/delete), or raise maxSizeBytes/scale to a tier with more storage (az sql db update --max-size 500GB). Do not retry.

12. 49918/49919/49920 — too many operations in progress for subscription. Cause: control-plane throttling — automation firing too many scale/create operations across many databases at once (not a database problem). Confirm: the failing call is an az sql db update/create or ARM loop, not a query; the message names the subscription. Fix: serialize and back off the management operations (delays/retries in the loop) or batch them — a deployment-pipeline fix, not a SQL fix.

13. 233 / error during the pre-login handshake. Cause: TLS mismatch — the client negotiates TLS 1.0/1.1 but the server requires 1.2+, or Encrypt is off. Confirm: az sql server show -g rg -n srv --query minimalTlsVersion returns 1.2; the client driver is old. Fix: set Encrypt=True, upgrade the driver (ODBC/JDBC/Microsoft.Data.SqlClient), and ensure the OS/runtime enables TLS 1.2.

14. 40551 / 40553 — session killed for excessive tempdb / memory. Cause: a single query needs a huge sort/hash (memory grant) or spills heavily to tempdb — an oversized aggregation, a missing index forcing a sort, or an unbounded result. Confirm: RESOURCE_SEMAPHORE waits in sys.dm_exec_requests; the plan shows spill warnings / large memory grants; Query Store avg_tempdb_space_used high. Fix: add the index that removes the sort, paginate, reduce the batch size, or cap the grant (OPTION (MAX_GRANT_PERCENT = n)); scale only if the demand is legitimate.

Best practices

The alerts that turn these into self-detecting incidents — wire each into Azure Monitor:

Alert on Signal source Threshold Catches
log_write_percent AzureMetrics > 90% for 5 min Log-rate throttling (most-missed)
workers_percent AzureMetrics > 80% for 5 min Worker exhaustion → 10928
cpu_percent AzureMetrics > 85% for 10 min Compute saturation
storage_percent AzureMetrics > 85% Approaching 40544
Long blocking chain Scheduled DMV query / Log Analytics blocker > 30 s Blocking before customers notice
Deadlocks/min system_health / metric > 0 sustained Deadlock storms (1205)
Failed connections AzureMetrics connection_failed spike Firewall/policy/auth regressions
Plan regression Query Store / Insights plans > 1 on top query Latency jumps with no deploy

Security notes

The least-privilege role choices, so you grant the smallest thing that works:

Need Grant Not Why
App reads data db_datareader db_owner Read-only blast radius
App writes data db_datawriter (+reader) db_owner No DDL/permission rights
Run diagnostic DMVs VIEW DATABASE STATE db_owner / sysadmin See state without data control
Read Query Store / perf VIEW DATABASE PERFORMANCE STATE broad admin Perf data without full state
Manage schema in CI/CD db_ddladmin (scoped) server admin DDL without data ownership
Manage server (control plane) RBAC SQL Server Contributor (scoped) Owner on the subscription Narrow control-plane scope

The connection-string hardening checklist — what each keyword should be in production:

Keyword Secure value Why
Encrypt True (or Strict/Mandatory) Forces TLS; avoids 233 downgrade
TrustServerCertificate False Validate the server cert (no MITM)
Authentication Active Directory Managed Identity No password to leak
User ID / Password omitted (use MI/Entra) Secrets out of config
Connection Timeout 15–30 s Fail fast on network/auth issues
Application Name a real app name Identifies the SPID in dm_exec_sessions

Cost & sizing

The bill drivers and where troubleshooting intersects cost:

Rough monthly figures (East US / South-India region, list price, single database — always confirm in the pricing calculator) so you can frame the tier conversation:

Tier / objective Model What you get Approx INR / month Approx USD / month Fits
Basic DTU 5 DTU, 2 GB ~₹420 ~$5 Tiny dev/test
S0 DTU 10 DTU, 250 GB ~₹1,250 ~$15 Small app, light load
S3 DTU 100 DTU, 1 TB ~₹12,500 ~$150 Mid OLTP
GP serverless 1 vCore vCore auto-pause, per-sec ~₹0–1,000 (idle-dependent) ~$0–12 Spiky / non-prod
GP provisioned 2 vCore vCore 2 vCore, 1 TB ~₹21,000 ~$250 Steady prod baseline
GP provisioned 8 vCore vCore 8 vCore, 1 TB ~₹84,000 ~$1,000 Busy OLTP
BC 8 vCore vCore 8 vCore, local SSD, readable replica ~₹2,20,000 ~$2,600 Low-latency / HA-critical
Hyperscale 8 vCore vCore 8 vCore, up to 100 TB ~₹95,000+ (storage-driven) ~$1,150+ Very large DBs

What each lever buys, and when it pays for itself:

Cost lever Saves Best for Trade-off
Serverless auto-pause Idle compute cost Dev/test, spiky Resume latency after pause
Reserved capacity (1/3-yr) ~30–55% on steady compute Predictable baseline Commitment
Elastic pool Compute across many DBs Fleets with uneven peaks Noisy-neighbour 10929
Right-size from peaks Over-provisioning Everything Needs measurement discipline
Hyperscale storage model Pre-provisioned storage Large/growing DBs Different ops model
Fix the query, don’t scale A whole tier’s spend Throttle from one bad query Engineering time

Discipline: size from measured peaks, use serverless for spiky/non-prod, reserve the baseline, and treat a recurring throttle as “fix the query first, then scale if the demand is legitimate.”

Interview & exam questions

1. A connection works from Cloud Shell but times out from a peered VNet. Hypothesis and confirmation? The redirect vs proxy trap: in-VNet clients default to Redirect, sending traffic to the node on 11000–11999 after the 1433 handshake; if that range is blocked it fails. Confirm via az sql server conn-policy show and nc to port 11000 vs 1433; fix by forcing Proxy or opening 11000–11999 to the Sql tag.

2. What does 40615 mean and the gotcha about the IP? The client’s public egress IP isn’t allowed by any firewall rule — and for VNet traffic the IP shown is the NAT/proxy public IP, not the VM’s private IP, so allow the egress IP (or use a VNet rule/Private Link).

3. Distinguish 10928 and 10929. 10928 = your database hit a hard per-database limit (Resource ID 1 = workers, 2 = sessions); 10929 = the server/elastic pool is too busy to honour your minimum guarantee — shared-level contention versus your DB’s own ceiling.

4. Network problem vs slow query? Look at which timeout fired: a connection timeout (15s) means the connection failed (Hops 1–4 — DNS/firewall/policy/auth); a command timeout (30s) means the query ran too long (blocking/index/plan). Different layers.

5. CPU at 30% but writes crawl? Log-rate governance: the log rate is capped at the tier, so writes wait on LOG_RATE_GOVERNOR/WRITELOG while CPU idles. Confirm with avg_log_write_percent at 100% in sys.dm_db_resource_stats; fix by batching writes or scaling.

6. Find a blocking chain. Query sys.dm_exec_requests where blocking_session_id <> 0, joined to sys.dm_exec_sql_text; the root blocker has blocking_session_id = 0. sp_who2 (BlkBy) is the quick view, sys.dm_tran_locks shows lock modes. Resolve by killing/committing the root blocker, then fix the cause.

7. Blocking vs deadlock? Blocking is one session waiting on another’s lock — it persists until released. A deadlock is mutual blocking the engine resolves by killing a victim (error 1205), so it auto-resolves but a transaction fails. Break blocking manually; design out deadlocks with consistent lock order.

8. Which errors should an app retry? The transient set — 40613, 40197, 40143, 10928, 10929, 10053/10054/10060, 233, 1205 — with exponential backoff. Never the persistent ones — 40615/40914 (firewall), 18456 (auth), 40544 (size quota) — which fail identically.

9. Query Store for a query that regressed with no code change? It persists query text, multiple plans and runtime stats over time, so you spot plans > 1 and a slower new plan; force the previously good plan (sp_query_store_force_plan) as an immediate fix without changing code.

10. Private Link added, in-VNet connections fail. What broke? Almost always DNS — the name still resolves to the public gateway IP because the privatelink.database.windows.net zone isn’t linked to the VNet (or custom DNS doesn’t forward to 168.63.129.16). Confirm with nslookup and az network private-dns link vnet list; set policy to Proxy.

11. What are 49918/49919/49920 and where to fix them? Control-plane (subscription-level) throttles — too many scale/create operations in progress, typically from automation looping over databases. Fix in the deployment pipeline (serialize, back off the az/ARM calls), not in any database.

12. How do tempdb and memory show up in throttling? 40551 terminates a session for excessive tempdb (sorts/spills/temp tables), 40553 for excessive memory (oversized grants); waits show RESOURCE_SEMAPHORE. The fix is query-side — reduce sort/hash size, add indexes, paginate.

13. The 18456 message gives no detail. How do you narrow it? Read the error state: state 8 = wrong password, 38/40 = wrong or inaccessible database (often connected to master), 102–111 = Entra token problems. The state turns a generic “login failed” into a precise cause.

14. Why prefer a VNet rule or Private Link over a server firewall rule for in-VNet apps? VNet traffic egresses through a NAT/firewall public IP that rotates, so firewall rules chase a moving target and produce intermittent 40615. A VNet rule allows the subnet, and Private Link removes the public path entirely — both stable.

Cert mapping: DP-300 (Azure Database Administrator Associate)monitor and optimize operational resources / query performance, configure resources for optimal performance — and AZ-305 for network-isolation design (Private Link, firewall, connection policy) of data tiers.

Quick check

  1. From a peered VNet, your app connects but queries hang; from Cloud Shell it’s fine. What’s the most likely cause and the two valid fixes?
  2. You see 10928 Resource ID: 1. Which resource did you exhaust, and what’s a frequent indirect cause of it?
  3. Your writes are slow but CPU is at 25%. Which governor do you suspect, and which DMV column confirms it?
  4. Which of these should an app retry: 40615, 40613, 18456, 10929? Why?
  5. What is the difference between the connection timeout and the command timeout, and what does each one’s expiry tell you about where the problem is?

Answers

  1. Redirect connection policy + blocked 11000–11999. In-VNet clients default to Redirect, which moves traffic to the node on the high-port range after the 1433 handshake; if that range is blocked the connection fails (Cloud Shell uses Proxy and works). Fix by forcing Proxy (az sql server conn-policy update --connection-type Proxy) or opening 1433 + 11000–11999 outbound to the Sql service tag.
  2. Worker threads (Resource ID 1; Resource ID 2 would be sessions). A frequent indirect cause is blocking — a long blocker makes every queued request hold a worker longer, so worker count climbs to the cap. Fix the blocking chain and/or scale up.
  3. Log-rate governance. Confirm with avg_log_write_percent (near 100%) in sys.dm_db_resource_stats, alongside LOG_RATE_GOVERNOR/WRITELOG waits in sys.dm_exec_requests. Fix by batching writes or scaling.
  4. Retry only 40613 and 10929 (transient — reconfiguration / shared-capacity busy). Do not retry 40615 (firewall) or 18456 (auth) — they’re persistent and fail identically until you fix the config/credential.
  5. The connection timeout (default 15s) bounds establishing the connection (DNS+TCP+TLS+login); its expiry means the problem is in the network/auth hops (1–4). The command timeout (default 30s) bounds a single query’s runtime; its expiry means the query is slow (blocking, missing index, plan regression) — an engine/query problem, not a network one.

Glossary

Next steps

You can now triage any Azure SQL incident by hop and prove the failing layer. Build on it with the adjacent topics:

AzureAzure SQL DatabaseTroubleshootingConnectivityThrottlingBlockingDMVPrivate Link
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading