Azure Security

Azure Key Vault: Secrets, Keys and Certificates Done Right

Quick take: Secrets in config files and certificates on disk are liabilities you cannot audit and cannot rotate. Azure Key Vault moves them into a managed, access-controlled, logged service where every retrieval is an identity-checked, recorded event and rotation becomes a single operation instead of a fleet-wide change.

A development team I reviewed stored a production database password directly in their App Service application settings, in plaintext, copied into a .env file on three developer laptops and pasted into a runbook. When a contractor rolled off, nobody could answer the only two questions that matter: who has seen this password, and where else does it live? Rotating it meant hunting through a dozen apps and hoping they’d found them all. The fix was not a policy memo — it was Key Vault. The password moved into a vault, the apps authenticated with a managed identity instead of a stored credential, every read was logged to Azure Monitor, and the next rotation was one az keyvault secret set followed by a config refresh. The contractor’s access evaporated the moment their identity was removed. That is the entire value proposition, and this article is how you get there without the three or four mistakes that turn Key Vault from a safety net into a 3am outage.

Key Vault holds three kinds of object — secrets (arbitrary strings: connection strings, API keys, passwords), keys (cryptographic keys used for encryption, signing and wrapping, optionally HSM-backed), and certificates (X.509 certs with a managed lifecycle and auto-renewal) — behind two distinct authorization surfaces (a control plane that manages the vault itself and a data plane that reads the objects inside it), reachable either over the public endpoint or locked behind a Private Endpoint. Get the mental model of those layers right and Key Vault is boringly reliable. Get it wrong — a managed identity that was never enabled, a data-plane role you forgot to assign, a firewall that blocks your own app, a certificate nobody wired for rotation — and you get the failure modes this article enumerates exhaustively, each with the exact az command or portal blade that confirms it and the precise fix.

By the end you will treat secrets, keys and certificates as governed assets rather than files. You will know when to use RBAC over the legacy access-policy model, why soft-delete and purge protection are non-negotiable, how Key Vault references let App Service and Functions pull secrets with zero credentials in config, when a workload needs an HSM (and whether Standard, Premium, or Managed HSM is the right home), and how to make certificates renew themselves so a 2am TLS expiry never happens again. Because this is a reference you will return to mid-incident, the options, limits, error codes, roles and tiers are all laid out as scannable tables — read the prose once, then keep the tables open.

What problem this solves

Applications need secrets, keys and certificates to function, but the places teams instinctively put them are all liabilities. A connection string in appsettings.json is in source control and on every laptop that cloned the repo. An API key in an environment variable is visible to anyone with the portal or a shell on the box. A .pfx certificate on disk is a file that can be copied, has no rotation story, and silently expires. None of these can answer “who accessed this and when,” none can be rotated without touching every consumer, and all of them widen the blast radius of a single leak to your entire estate.

What breaks without Key Vault is not abstract. A leaked credential in a public Git history is among the most common breach vectors there is — and once it is in history, rotating is the only remedy, because the old value is permanent. Hard-coded secrets mean rotation is a coordinated, error-prone deployment instead of a config change, so teams simply don’t rotate, and a five-year-old database password is “fine until it isn’t.” Certificates that live on disk expire without warning and take production TLS down at the worst possible moment. And without a central audit trail, a security review cannot prove who touched what, which fails most compliance regimes outright.

Who hits this: essentially every team running anything on Azure. It bites hardest where secrets multiply — microservice estates with dozens of connection strings, apps with third-party API keys, anything terminating TLS on a custom domain, and any workload under a compliance regime (PCI-DSS, HIPAA, ISO 27001, SOC 2) that mandates key custody, rotation, and access logging. Key Vault is the Azure-native answer to all of it, and the cost of getting it slightly wrong is exactly the kind of failure that pages you. The whole field, framed before the deep dive:

Pain in production What it looks like Root liability What Key Vault changes
Secret in config / source control Password in appsettings.json, in Git history Plaintext, copyable, permanent in history Secret lives in the vault; config holds a reference, not the value
No idea who saw a credential Contractor leaves, nobody can audit access No access log Every read is a logged, identity-attributed event
Rotation is a deployment Changing a DB password touches 12 apps Value duplicated everywhere Rotate once in the vault; consumers re-read
Certificate expired at 2am TLS down, frantic manual renewal Cert on disk, no lifecycle Managed cert with auto-renewal + expiry events
Encryption key on the app box Key file alongside the data it protects Key and data co-located Key in the vault (or HSM); app calls wrap/unwrap
Compliance audit fails Cannot prove key custody / rotation No central control or trail Centralized custody, RBAC, soft-delete, audit logs

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should be comfortable with the Azure Resource Manager model — subscriptions, resource groups, and that everything is a resource with an ID (the Azure Resource Hierarchy Explained covers this). You should understand Microsoft Entra ID (formerly Azure AD) at the level of “identities get tokens and tokens are checked against permissions,” and ideally have met managed identities before. Running az in Cloud Shell, reading JSON output, and basic TLS/certificate concepts (a cert has a private key, a chain, and an expiry) will all help. Nothing here requires cryptography expertise — Key Vault’s job is to make you not need it.

This sits at the heart of the Security & Identity track and is upstream of almost everything else. Apps pull secrets from it (Azure Functions and Serverless Patterns and App Service both use Key Vault references), gateways pull TLS certs from it (Application Gateway v2 WAF: End-to-End TLS, mTLS, and Custom Rule Tuning), container registries and storage use customer-managed keys housed in it, and App Configuration references it for secret-typed settings (Azure App Configuration in Production). When you lock it behind a Private Endpoint you are applying the same pattern as Azure Private Link and Private DNS: Keeping PaaS Off the Public Internet. A quick map of who owns what during an incident, so you call the right person:

Layer What lives here Who usually owns it Failure classes it can cause
Caller identity Managed identity, app registration App / dev team Empty KV reference, app crash-loop
Entra ID Token issuance, RBAC assignments Identity team Token denied; no role assigned (403)
Vault control plane SKU, firewall, soft-delete, RBAC mode Platform / security Misconfigured network, wrong auth model
Vault data plane Secrets/keys/certs read/write App + security 403 on get; throttling (429)
Network path Private Endpoint, DNS, firewall ACLs Network team ForbiddenByFirewall; DNS resolves public
Backing CA / HSM Certificate issuer, HSM key custody Security / PKI Cert won’t issue; key not exportable

Core concepts

Five mental models make every later decision obvious.

A vault is a boundary, not a database. A Key Vault is a named, regional resource (https://<name>.vault.azure.net) that holds three object types and enforces who can do what to them. It is a security and governance boundary first — you separate vaults by environment and sensitivity, not by convenience. The vault name is globally unique because it becomes a public DNS name, even when you later restrict it to a Private Endpoint.

Control plane and data plane are different doors with different keys. The control plane (Azure Resource Manager) governs the vault as a resource: create/delete it, set its firewall, change its SKU, configure soft-delete, assign data-plane roles. You authorize it with Azure RBAC roles like Key Vault Contributor, scoped at subscription/RG/vault. The data plane governs the objects inside: get a secret, sign with a key, import a cert. You authorize it either with Azure RBAC data-action roles (e.g. Key Vault Secrets User) or with the legacy per-vault access-policy list — and you pick exactly one model per vault. The single most common Key Vault mistake is confusing these: a Key Vault Contributor can manage the vault but cannot read a secret unless they also hold a data-plane role. Management access is not data access.

Identity is the currency; managed identity is the way you pay. Every data-plane call must present a valid Entra ID token proving an identity, which the vault checks against its authorization model. For apps, the right identity is a managed identity — an Entra identity Azure manages for the resource, with no secret you store anywhere. The app asks the platform for a token, the platform returns one, the app calls the vault. This is the whole point: the credential to access your secrets is itself not a stored secret. No managed identity means no token means the call fails — which is exactly why a forgotten identity makes an app crash-loop with empty secret values.

Soft-delete and purge protection make deletion survivable. Soft-delete (mandatory and always on for new vaults) means a deleted vault or object enters a recoverable state for a retention period (7–90 days, default 90) instead of vanishing. Purge protection (optional but recommended, and irreversible once enabled) means that during the retention window, nobody — not even an owner, not even an attacker with full rights — can permanently purge the resource early. Together they defend against accidental delete and malicious “delete everything” attacks. The cost is that a soft-deleted vault name is reserved until it’s purged or recovered, which trips up redeployments.

Versions are immutable; rotation creates a new version. Every secret, key and certificate is versioned. Updating a secret doesn’t overwrite — it adds a new version and marks it current; old versions remain (until you disable/delete them). You can reference a specific version (pinned) or the current version (auto-following). This is what makes rotation safe: you create version 2, consumers that reference “current” pick it up, and version 1 is still there if you need to roll back. Reference the unversioned URI to follow rotation; reference the versioned URI to pin.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the model side by side:

Concept One-line definition Where it lives Why it matters
Vault A regional container for secrets/keys/certs Subscription / resource group The security boundary; one per env/sensitivity
Secret An arbitrary string value (≤25 KB) Inside a vault Connection strings, passwords, API keys
Key A cryptographic key (RSA/EC), optionally HSM Inside a vault Encrypt/decrypt, sign/verify, wrap/unwrap
Certificate An X.509 cert with managed lifecycle Inside a vault (key + secret pair) TLS, mTLS, code signing; auto-renewal
Control plane Managing the vault resource Azure Resource Manager Create/delete/firewall/SKU; RBAC-governed
Data plane Reading/writing objects inside *.vault.azure.net Get/set/sign; RBAC or access policy
Access policy Legacy per-vault permission list On the vault One of two auth models (the old one)
Azure RBAC (data) Role-based data-plane access Entra + scope The recommended auth model
Managed identity Secret-free Entra identity for a resource On the app/VM/etc. How apps authenticate with no stored creds
KV reference @Microsoft.KeyVault(...) in a setting App setting / App Config App pulls a secret with zero creds in config
Soft-delete Recoverable deletion window Vault property 7–90 day grace; mandatory now
Purge protection Block early permanent deletion Vault property Irreversible; defends against malicious purge
HSM Hardware Security Module key custody Premium / Managed HSM Keys never leave certified hardware
Rotation policy Auto-renew schedule for a secret/cert On the object Hands-off rotation; expiry events

Secrets, keys and certificates — choosing the right object

The first decision on every asset is which object type it belongs in. Putting a TLS certificate in as a raw secret, or storing a password as a “key,” works just badly enough to cause pain later. Here is the definitive comparison:

Dimension Secret Key Certificate
What it is Arbitrary string/bytes Cryptographic key (RSA, EC) X.509 cert + its key
Typical use Connection strings, passwords, API keys Encrypt/decrypt, sign/verify, wrap/unwrap (CMK) TLS/mTLS, code signing
You can read the value? Yes — get returns the string No — key material never leaves; you call operations Public cert yes; private key only if exportable
Size / shape limit 25 KB value RSA 2048/3072/4096; EC P-256/384/521 Bound by underlying key + secret limits
HSM-backed option No Yes (Premium / Managed HSM) Via its key (Premium)
Versioned Yes Yes Yes
Auto-rotation Rotation policy (preview/GA varies) Manual or scripted Yes — integrated CA auto-renew
Backing storage Single object Single object Stored as a key + a secret under the hood
Cert exposed as 3 objects n/a n/a Certificate, Key, and Secret (the PFX/PEM) entries

Three reading notes that prevent the most common modelling mistakes:

If you have… Put it in as a… Not a… Because
A database connection string Secret Key It’s a string you read back; keys don’t return material
An RSA key to encrypt blobs (CMK) Key Secret You want sign/wrap operations, not the raw bytes
A TLS cert for a custom domain Certificate Secret (raw PFX) The certificate object gives lifecycle + auto-renew
A symmetric password/passphrase Secret Key Key Vault keys are asymmetric (RSA/EC); symmetric → secret or Managed HSM
An SSH private key Secret Key It’s opaque bytes you retrieve, not a KV crypto key

Secrets in depth

A secret is a versioned name→value pair where the value is any string up to 25 KB, plus optional attributes: enabled (a disabled secret can’t be read), activation date (nbf — not usable before), expiry date (exp — not usable after), content-type tag, and arbitrary metadata tags. Crucially, Key Vault does not enforce expiry by refusing to serve an expired secret in the way you might expect — it returns it but the exp attribute is advisory; you enforce it via rotation and monitoring. Set one and read it back:

# Create/update a secret (this becomes a new version, marked current)
az keyvault secret set --vault-name kv-shop-prod --name DbConnString \
  --value "Server=tcp:sql-shop.database.windows.net;Database=orders;..." \
  --expires "2026-12-31T00:00:00Z" --content-type "text/plain"

# Read the current version (the value comes back in plaintext to an authorized caller)
az keyvault secret show --vault-name kv-shop-prod --name DbConnString --query value -o tsv

In Bicep you generally create the vault declaratively and set secret values out-of-band (you don’t want plaintext secrets in templates), but you can declare a secret resource whose value comes from a secure parameter:

@secure()
param dbConnString string

resource kv 'Microsoft.KeyVault/vaults@2023-07-01' existing = { name: 'kv-shop-prod' }

resource secret 'Microsoft.KeyVault/vaults/secrets@2023-07-01' = {
  parent: kv
  name: 'DbConnString'
  properties: {
    value: dbConnString          // pass via secure pipeline variable, never literal
    contentType: 'text/plain'
    attributes: { enabled: true, exp: 1798675200 } // unix epoch
  }
}

The full secret attribute set and how to reason about each:

Attribute What it does Default When to set it Gotcha
enabled Whether the secret can be read true Disable to revoke without deleting A disabled current version → consumers fail
exp (expires) Advisory expiry timestamp none Force a rotation deadline KV still returns it; you must monitor/rotate
nbf (not-before) Not usable before this time none Stage a future value Reads before nbf fail
contentType Free-text hint (e.g. mime) none Label PFX vs text vs JSON Purely informational
Tags Key/value metadata none Ownership, env, rotation owner Tags are not secret — no values in them
recoveryLevel Soft-delete/purge posture (read-only) inherits vault Reflects vault soft-delete + purge settings
Value size The string itself n/a Hard cap 25 KB; larger → use Blob + CMK

Keys in depth

A key is cryptographic material you never see. You don’t get the bytes; you ask the vault to perform an operation with it — encrypt/decrypt, wrap/unwrap (key-wrapping for envelope encryption), sign/verify. This is the model behind customer-managed keys (CMK) for Storage, SQL TDE, Disk Encryption and Container Registry: the service holds your data, your key stays in Key Vault, and the service calls wrap/unwrap. Keys come in RSA (2048/3072/4096) and EC (P-256/P-384/P-521, and the secp256k1 variant), each optionally HSM-backed (the -HSM key types) on Premium or Managed HSM.

# Create an RSA 3072 key, software-protected (Standard) — add --protection hsm for Premium
az keyvault key create --vault-name kv-shop-prod --name cmk-storage \
  --kty RSA --size 3072 --ops wrapKey unwrapKey

# Use it to wrap (encrypt) a small payload — the bytes never leave the vault unencrypted
az keyvault key encrypt --vault-name kv-shop-prod --name cmk-storage \
  --algorithm RSA-OAEP-256 --value "$(echo -n 'data-key' | base64)" --data-type base64

The key option matrix — type, size, protection, allowed operations:

Setting Values Default When to change Trade-off / limit
Key type (kty) RSA, RSA-HSM, EC, EC-HSM, oct-HSM RSA EC for smaller/faster sigs; HSM for custody oct (symmetric) only on Managed HSM
RSA size 2048, 3072, 4096 2048 3072+ for stronger/longer-lived keys Larger = slower ops
EC curve P-256, P-384, P-521, P-256K P-256 P-384/521 for higher assurance secp256k1 niche (blockchain)
Protection software, HSM software (Standard) HSM for FIPS / compliance HSM keys can’t be exported in cleartext
Operations (ops) encrypt, decrypt, sign, verify, wrap, unwrap all Least privilege per key Granting all when you need wrap only
Exportable true/false (release policy) false Only with secure-key-release + attestation Most keys must be non-exportable
Rotation policy auto/manual manual Schedule key rotation New version; CMK consumers must follow

The cryptographic operations a key supports, and what each is for:

Operation What it does Typical caller Algorithm examples
encrypt / decrypt Protect small payloads directly App doing envelope encryption RSA-OAEP-256
wrap / unwrap Wrap a data-encryption key (CMK) Storage / SQL TDE / Disk RSA-OAEP-256, AES-KW (Managed HSM)
sign / verify Produce/check a digital signature Token/code/document signing RS256, PS256, ES256
getKey (public part) Read the public key only Verifiers, JWKS publishers Public material only; private never leaves
(import) Bring an existing key in Migration / BYOK RSA/EC, optionally --byok HSM

Certificates in depth

A certificate is the richest object: it bundles an X.509 cert, its private key (stored as a Key Vault key), and the exportable form (stored as a Key Vault secret — the PFX/PEM). That is why a single certificate shows up as three addressable objects: a certificate, a key, and a secret with the same name. Key Vault manages the lifecycle: issuance from an integrated CA (DigiCert, GlobalSign) or a self-signed/internal CA policy, and automatic renewal before expiry. This is the feature that makes “certificate expired at 2am” a solved problem.

# Create a self-signed cert with a policy (real workloads point issuerName at an integrated CA)
az keyvault certificate create --vault-name kv-shop-prod --name tls-shop \
  --policy "$(az keyvault certificate get-default-policy)"

# Inspect renewal/lifecycle and the three backing objects' URIs
az keyvault certificate show --vault-name kv-shop-prod --name tls-shop \
  --query "{sub:policy.x509CertificateProperties.subject, sid:sid, kid:kid}" -o json

The certificate policy controls issuance and renewal — the settings that matter:

Policy setting What it controls Typical value When to change Gotcha
issuerName Who signs the cert Self, DigiCert, GlobalSign Public TLS → integrated CA Self certs aren’t publicly trusted
Subject / SANs CN and Subject Alternative Names CN=shop.example.com + SANs Multi-domain certs Missing SAN → browser errors
Key type/size Backing key RSA 2048/3072, EC P-256 Stronger key or EC Must match what your endpoint accepts
Validity (months) Cert lifetime 12 (public CAs cap ~13 months) Shorter for higher rotation CA may override to its max
exportable Whether the PFX can be exported true (software), false (HSM) Non-exportable for HSM custody Non-exportable → App Service can’t import PFX
Auto-renewal (renewBeforeExpiry/lifetime action) Renew N days/% before expiry 30 days / 80% lifetime Always set for managed certs Self-signed renews; integrated CA needs CA wired
Renewal type AutoRenew vs EmailContacts AutoRenew Hands-off vs notify-only EmailContacts only warns; doesn’t renew

How a certificate maps to its three backing objects (the source of much confusion):

Object exposed URI form Contains Use it for
Certificate /certificates/<name> Public cert + policy + metadata Lifecycle, thumbprint, renewal status
Key /keys/<name> The private key (operations only) Sign/decrypt without exporting the key
Secret /secrets/<name> The full PFX/PEM (if exportable) Importing into App Service / App Gateway

The two authorization models — RBAC vs access policies

This is where most teams either get it right and never think about it again, or get it wrong and fight 403s for a week. Every vault uses exactly one data-plane authorization model: modern Azure RBAC or legacy access policies. You set it at vault creation with enableRbacAuthorization and changing it later is disruptive.

Access policies (the original model) are a per-vault list: “this principal may do these operations on secrets, these on keys, these on certs.” They are flat (no inheritance), capped at 1024 entries per vault, not visible to Azure RBAC tooling, and grant operation permissions (get/list/set/delete) per object type. Azure RBAC instead uses standard role assignments — built-in roles like Key Vault Secrets User assigned at management-group/subscription/RG/vault/object scope — giving you inheritance, central governance through az role assignment, PIM/just-in-time eligibility, and a single consistent model across Azure. For anything new, use RBAC.

Dimension Azure RBAC (recommended) Access policies (legacy)
Granularity Built-in/custom roles, down to individual object scope Per-object-type operation flags
Inheritance Yes — MG → sub → RG → vault → object No — flat list on the vault
Scale limit Azure RBAC limits (very high) 1024 access policy entries / vault
Central management az role assignment, Policy, PIM Per-vault, bespoke
Just-in-time (PIM) Yes (eligible assignments) No
Separation of duties Control vs data roles are distinct Mixed in one place
Visibility Standard “Access control (IAM)” Separate “Access policies” blade
Default for new vaults Increasingly the recommended default Still the portal default in places

The data-plane RBAC roles you actually use — assign the narrowest that fits:

Role Grants (data plane) Give it to Don’t give it to
Key Vault Secrets User Read secret values App managed identities Humans who only need to manage the vault
Key Vault Secrets Officer Full secret CRUD Secret administrators / pipelines Read-only apps
Key Vault Crypto User Use keys (encrypt/sign/wrap) Services doing crypto ops (CMK) Apps that only read secrets
Key Vault Crypto Officer Full key CRUD Key administrators App identities
Key Vault Certificates Officer Full certificate CRUD Cert administrators / automation Read-only consumers
Key Vault Reader Read metadata (not values) Auditors, dashboards Anyone needing values
Key Vault Crypto Service Encryption User Wrap/unwrap for service CMK Storage/SQL/etc. service principal Interactive users

Control-plane roles — note they grant nothing on the data inside:

Control-plane role Grants Critical caveat
Key Vault Contributor Manage the vault (firewall, SKU, policies) Cannot read secrets — needs a data role too
Owner / Contributor (subscription) Everything at control plane Same caveat: not automatically a data reader
Reader View the vault resource No data-plane access at all

Assign a data-plane role to an app’s managed identity — the canonical pattern:

# Get the app's managed identity principal, then grant Secrets User at the vault scope
PRINCIPAL=$(az webapp identity show -n app-shop-prod -g rg-shop-prod --query principalId -o tsv)
VAULT_ID=$(az keyvault show -n kv-shop-prod -g rg-shop-prod --query id -o tsv)
az role assignment create --assignee "$PRINCIPAL" \
  --role "Key Vault Secrets User" --scope "$VAULT_ID"
// Vault in RBAC mode + a Secrets User assignment for an app's identity
resource kv 'Microsoft.KeyVault/vaults@2023-07-01' = {
  name: 'kv-shop-prod'
  location: location
  properties: {
    sku: { family: 'A', name: 'standard' }
    tenantId: subscription().tenantId
    enableRbacAuthorization: true        // RBAC model, not access policies
    enableSoftDelete: true
    softDeleteRetentionInDays: 90
    enablePurgeProtection: true
    publicNetworkAccess: 'Disabled'      // pair with a Private Endpoint
  }
}

resource secretsUser 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(kv.id, appPrincipalId, 'Key Vault Secrets User')
  scope: kv
  properties: {
    // 4633e6cd-... is the role definition ID for Key Vault Secrets User
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '4633e6cd-...')
    principalId: appPrincipalId
    principalType: 'ServicePrincipal'
  }
}

The legacy access-policy equivalent, for the vaults you inherit that still use it:

# Only works on a vault with enableRbacAuthorization = false
az keyvault set-policy --name kv-legacy --object-id "$PRINCIPAL" \
  --secret-permissions get list

When to pick which model — the decision table:

If… Use Why
New vault, modern estate Azure RBAC Central governance, inheritance, PIM
You need just-in-time elevation Azure RBAC Access policies have no PIM
You need >1024 distinct grantees Azure RBAC Access policies cap at 1024
You’re maintaining a vault already on access policies Keep access policies (or plan a migration window) Switching models is disruptive mid-flight
You want per-secret (object-level) scope Azure RBAC Assign roles at the individual object scope

Managed identity and Key Vault references — secrets with zero credentials

The payoff of all this is that an app reads its secrets without storing any credential at all. Two pieces make it work: a managed identity on the app (so it can get an Entra token), and a Key Vault reference in a setting (so the value is pulled from the vault at runtime rather than stored in config).

A Key Vault reference is a special app-setting (App Service/Functions) or App Configuration value of the form @Microsoft.KeyVault(SecretUri=https://kv-shop-prod.vault.azure.net/secrets/DbConnString/). At startup (and on a refresh interval) the platform resolves it using the app’s managed identity and injects the resolved value as the environment variable your code reads. Your code sees a normal connection string; the value never sits in config.

# 1) Give the app a system-assigned managed identity
az webapp identity assign -n app-shop-prod -g rg-shop-prod

# 2) Grant that identity read access to secrets (RBAC)
PRINCIPAL=$(az webapp identity show -n app-shop-prod -g rg-shop-prod --query principalId -o tsv)
az role assignment create --assignee "$PRINCIPAL" --role "Key Vault Secrets User" \
  --scope "$(az keyvault show -n kv-shop-prod -g rg-shop-prod --query id -o tsv)"

# 3) Point an app setting at the secret via a KV reference
az webapp config appsettings set -n app-shop-prod -g rg-shop-prod --settings \
  "DbConnString=@Microsoft.KeyVault(SecretUri=https://kv-shop-prod.vault.azure.net/secrets/DbConnString/)"
resource site 'Microsoft.Web/sites@2023-12-01' = {
  name: 'app-shop-prod'
  location: location
  identity: { type: 'SystemAssigned' }   // the identity that resolves the reference
  properties: {
    serverFarmId: plan.id
    siteConfig: {
      appSettings: [
        {
          name: 'DbConnString'
          // unversioned URI → follows rotation automatically
          value: '@Microsoft.KeyVault(SecretUri=https://kv-shop-prod${environment().suffixes.keyvaultDns}/secrets/DbConnString/)'
        }
      ]
    }
  }
}

The two reference URI styles and their behaviour:

Reference form Example tail Behaviour Use when
Unversioned /secrets/DbConnString/ Follows the current version (picks up rotation) You want rotation to flow without redeploy
Versioned /secrets/DbConnString/<ver> Pinned to that exact version You need a deterministic, audited value

Why a Key Vault reference fails — the exact prerequisites, each a failure mode if missing:

Prerequisite If missing… Confirm Fix
Managed identity enabled Reference resolves to empty; app crash-loops az webapp identity show az webapp identity assign
Data-plane role assigned 403 on resolve; value empty az role assignment list --assignee <principal> --scope <vaultId> Assign Key Vault Secrets User
Vault firewall allows the app ForbiddenByFirewall; empty value az keyvault show --query properties.networkAcls Allow trusted services / Private Endpoint
Secret exists & enabled Reference resolves to nothing az keyvault secret show ... Create/enable the secret; fix the URI
Correct SecretUri Silent failure / wrong value Compare URI to az keyvault secret show --query id Fix host/object/version in the URI
App Configuration reference (if via App Config) Setting unresolved App Config “Key Vault reference” status Grant App Config’s identity Secrets User too

A subtle one: when references are cached, a rotation may not be picked up until the app restarts or the platform refresh fires. The status of every reference is visible in the portal Environment variables blade (each shows resolved/error), which is the first place to look when a secret-backed setting “isn’t taking.”

Soft-delete, purge protection and recovery

These two features turn deletion from a catastrophe into an inconvenience — and one of them is irreversible, so understand it before you flip it.

Soft-delete is now always on for vaults (you cannot disable it on new vaults). When you delete a vault or an object, it is retained in a deleted but recoverable state for the retention period (configurable 7–90 days; default 90). During that window you can recover it. After the window — or if someone purges it deliberately — it’s gone. Purge protection closes the deliberate-purge hole: when enabled, no one can purge the vault or its objects before the retention period elapses, not even with full permissions. This is the control that defeats “attacker with Owner deletes and purges everything.” The catch: purge protection is irreversible — once on, you cannot turn it off for the life of the vault, and the retention period becomes a hard floor.

# Inspect the deletion posture
az keyvault show -n kv-shop-prod -g rg-shop-prod \
  --query "{softDelete:properties.enableSoftDelete, retention:properties.softDeleteRetentionInDays, purge:properties.enablePurgeProtection}" -o json

# Recover a soft-deleted secret (within the retention window)
az keyvault secret recover --vault-name kv-shop-prod --name DbConnString

# List and recover a soft-deleted *vault*
az keyvault list-deleted --query "[].{name:name, scheduledPurge:properties.scheduledPurgeDate}" -o table
az keyvault recover --name kv-shop-prod

The two settings, their effects, and the trade-offs:

Setting Values Default (new vaults) Effect Irreversible?
Soft-delete on (forced) on Deleted objects recoverable for the retention window n/a (always on)
Retention period 7–90 days 90 How long recovery is possible Can’t shorten below current with purge protection on
Purge protection on / off recommended on Blocks early permanent purge by anyone Yes — cannot be disabled once on

Recovery scenarios and the exact operation:

You deleted… State Recover with Caveat
A secret/key/cert Soft-deleted az keyvault secret recover (etc.) Within retention; needs recover permission
The whole vault Soft-deleted az keyvault recover -n <name> Name reserved until recovered/purged
And purged it (no purge protection) Gone Unrecoverable; this is what PP prevents
And purge protection was on Cannot purge early Wait out retention or recover The “delete everything” attack fails here

The redeployment gotcha worth its own table — soft-delete reserves the name:

Symptom Cause Confirm Fix
VaultAlreadyExists on create, but you don’t see it A same-named vault is soft-deleted, holding the name az keyvault list-deleted Recover it, or purge it (if PP off and retention permits), or pick a new name
Bicep/Terraform deploy fails recreating a vault Prior delete left a soft-deleted vault Same as above Use az keyvault recover then let IaC adopt it

Network isolation — the firewall and Private Endpoint

By default a vault is reachable on its public endpoint (still requiring auth). For sensitive data that’s not enough — you want the vault unreachable from the internet at all. Two layers do this: the vault firewall (IP/VNet allow-lists with a default-deny) and, the strong form, a Private Endpoint that gives the vault a private IP inside your VNet and removes the public path entirely (the same model as Azure Private Endpoint vs Service Endpoint: Secure PaaS Access).

The trap is locking the vault down and then blocking your own callers — including App Service Key Vault references and Azure services that need access. Two escape hatches matter: Allow trusted Microsoft services (lets certain first-party services through the firewall) and correct Private DNS so the vault’s hostname resolves to the private IP for your callers.

# Default-deny, then allow a specific VNet subnet and trusted services
az keyvault update -n kv-shop-prod -g rg-shop-prod \
  --default-action Deny --bypass AzureServices

az keyvault network-rule add -n kv-shop-prod -g rg-shop-prod \
  --vnet-name vnet-shop --subnet snet-app

# The strong form: disable public access and add a Private Endpoint
az keyvault update -n kv-shop-prod -g rg-shop-prod --public-network-access Disabled
az network private-endpoint create -n pe-kv-shop -g rg-shop-prod \
  --vnet-name vnet-shop --subnet snet-pe \
  --private-connection-resource-id "$(az keyvault show -n kv-shop-prod -g rg-shop-prod --query id -o tsv)" \
  --group-id vault --connection-name kv-conn
resource kv 'Microsoft.KeyVault/vaults@2023-07-01' = {
  name: 'kv-shop-prod'
  location: location
  properties: {
    sku: { family: 'A', name: 'standard' }
    tenantId: subscription().tenantId
    enableRbacAuthorization: true
    publicNetworkAccess: 'Disabled'
    networkAcls: {
      defaultAction: 'Deny'
      bypass: 'AzureServices'           // let trusted first-party services through
      virtualNetworkRules: [ { id: appSubnetId } ]
      ipRules: []
    }
  }
}

The network controls, what each does, and the failure it causes when wrong:

Control Setting Effect Failure if misconfigured
Default action Deny / Allow Default-deny is the secure posture Deny with no rules → you lock yourself out
IP rules CIDR allow-list Permit specific public IPs Office IP changes → 403 ForbiddenByFirewall
VNet rules (service endpoint) Subnet allow-list Permit a subnet Wrong subnet → caller blocked
Bypass AzureServices / None Let trusted services through None → App Service KV refs may break
Public network access Enabled / Disabled Remove the public path entirely Disabled without PE/DNS → nothing can reach it
Private Endpoint + Private DNS zone Private IP, internet path gone Missing DNS → hostname resolves public → blocked

Decision table — how locked-down should this vault be?

Workload Recommended network posture Why
Dev/sandbox vault Public + firewall (your IPs) or trusted services Convenience; low sensitivity
Standard production app Private Endpoint + public disabled Secrets off the internet entirely
Regulated (PCI/HIPAA) Private Endpoint + PP + RBAC + Managed HSM keys Compliance mandates isolation + custody
Vault used by many Azure PaaS Firewall + bypass AzureServices First-party services need a path

HSM, Premium and Managed HSM — when hardware custody matters

For most secrets, the Standard SKU (software-protected keys) is correct and cheaper. You step up to hardware-backed key custody when compliance or risk demands that key material never exist in software. Three homes exist: Standard (software keys), Premium (a vault SKU adding HSM-protected keys on shared, FIPS 140-2 Level 2 validated HSMs), and Managed HSM (a dedicated, single-tenant pool of FIPS 140-2 Level 3 HSMs with its own RBAC and higher throughput). The decision is about assurance level and isolation, not features you can’t otherwise get.

Dimension Standard Premium Managed HSM
Key protection Software HSM (shared) HSM (dedicated, single-tenant)
FIPS 140-2 level n/a (software) Level 2 Level 3
Tenancy Multi-tenant Multi-tenant Single-tenant pool
Secrets & certs Yes Yes Keys-focused (no secrets/certs object types)
Throughput Standard vault limits Standard vault limits Much higher, dedicated
Cost model Per-operation, low Per-operation + HSM key surcharge Fixed hourly per HSM pool (significant)
RBAC Azure RBAC / access policy Azure RBAC / access policy Local HSM RBAC + Azure RBAC
Use when Most secrets/keys HSM keys, modest scale Strict compliance, high crypto throughput, BYOK

When to choose each — the decision table:

Requirement Choose Why
Connection strings, API keys, TLS certs Standard Software protection is fine and cheap
CMK with FIPS 140-2 Level 2 Premium HSM-backed keys without dedicated-pool cost
FIPS 140-2 Level 3 / single-tenant custody Managed HSM Dedicated HSMs, strongest assurance
Very high crypto ops/sec Managed HSM Dedicated throughput, not shared limits
BYOK / strict key-ceremony import Managed HSM (or Premium) Secure key import / HSM-to-HSM
Tight budget, no compliance mandate Standard Avoid the HSM surcharge entirely

A common misread: you do not need Premium just to store secrets securely — Standard already encrypts everything at rest. Premium/Managed HSM is specifically about where the key material lives and what it’s certified to. Pay for it when an auditor asks “is this key in a FIPS-validated HSM,” not before.

Rotation — secrets, keys and certificate auto-renewal

Rotation is the feature that justifies the whole exercise, and the one most teams under-implement. There are three flavours, increasing in automation: manual (you set a new version on a schedule), policy-driven secret rotation (a rotation policy plus an Event Grid + Function that updates the backing service too), and certificate auto-renewal (the vault renews the cert itself before expiry).

Certificates are the easy win: set a lifetime action and the vault renews automatically — self-signed certs renew outright; integrated-CA certs renew through the wired CA. Secrets are harder because rotating a database password means also changing it in the database — Key Vault can store a new version, but something must update the backing system. The standard pattern: a rotation policy on the secret raises a NearExpiry event via Event Grid, which triggers a Function that rotates the credential in the backing service and writes the new value back as a new secret version. Consumers using the unversioned reference pick it up.

# Certificate auto-renewal: renew when 30 days remain (lifetime action on the policy)
az keyvault certificate create --vault-name kv-shop-prod --name tls-shop --policy '{
  "issuerParameters": {"name": "Self"},
  "x509CertificateProperties": {"subject": "CN=shop.example.com", "validityInMonths": 12},
  "lifetimeActions": [{"trigger": {"daysBeforeExpiry": 30}, "action": {"actionType": "AutoRenew"}}],
  "keyProperties": {"exportable": true, "keyType": "RSA", "keySize": 3072, "reuseKey": false}
}'

# Secret rotation policy: rotate 30 days before a 90-day expiry, and emit events
az keyvault secret set-attributes --vault-name kv-shop-prod --name DbPassword \
  --expires "$(date -u -d '+90 days' +%Y-%m-%dT%H:%M:%SZ)"

Wire the expiry event to automation via Event Grid:

# Subscribe a Function to Key Vault near-expiry / rotation events
az eventgrid event-subscription create --name kv-rotation \
  --source-resource-id "$(az keyvault show -n kv-shop-prod -g rg-shop-prod --query id -o tsv)" \
  --endpoint-type azurefunction \
  --endpoint "$(az functionapp function show -g rg-shop-prod -n fn-rotate --function-name Rotate --query id -o tsv)" \
  --included-event-types Microsoft.KeyVault.SecretNearExpiry Microsoft.KeyVault.CertificateNearExpiry

The rotation approaches compared:

Approach Automation Updates backing service? Effort Best for
Manual secret set None No (you do it) Low setup, high ongoing toil Rarely-rotated, low-risk secrets
Secret rotation policy + Event Grid + Function High Yes (your function) Medium (build the function) DB passwords, signing keys, API keys
Certificate auto-renewal (self-signed) Full n/a (cert renews itself) Trivial (policy lifetime action) Internal/self-signed TLS
Certificate auto-renewal (integrated CA) Full n/a Medium (wire the CA issuer) Public TLS on custom domains

The Key Vault events you can subscribe to (the rotation triggers):

Event type Fires when Typical handler action
Microsoft.KeyVault.SecretNearExpiry A secret approaches exp Rotate the credential, write new version
Microsoft.KeyVault.SecretExpired A secret has expired Alert / emergency rotate
Microsoft.KeyVault.CertificateNearExpiry A cert approaches expiry Renew (or verify auto-renew fired)
Microsoft.KeyVault.CertificateNewVersionCreated A new cert version exists Re-import to App Service / App Gateway
Microsoft.KeyVault.KeyNearExpiry A key approaches expiry Rotate CMK; notify consumers
Microsoft.KeyVault.SecretNewVersionCreated A new secret version exists Refresh caches / restart consumers

A reality check on consumers: a new version existing doesn’t mean every consumer is using it. App Service Key Vault references refresh on an interval or restart; App Gateway/Front Door need the cert re-imported (or, with managed-identity integration, re-synced). The CertificateNewVersionCreated event is your hook to push the new cert where it’s needed.

The throttling, limits and 403 reference

Key Vault is a shared, throttled service, and almost every production surprise is one of three things: a 403 (you’re not allowed, or the firewall blocked you), a 429 (you exceeded the transaction limit), or a missing object. Scan this first when something fails.

The error/status-code reference — the lookup table you keep open:

Code Meaning Likely cause How to confirm Fix
401 Unauthorized No/invalid token Identity not sending a valid Entra token Caller has no managed identity / wrong audience Enable identity; request https://vault.azure.net audience
403 Forbidden (AccessDenied) Authenticated but not authorized No data-plane role / access policy az role assignment list --scope <vaultId>; access-policy blade Assign Key Vault Secrets User (or policy)
403 ForbiddenByFirewall Network ACL blocked the caller Firewall default-deny, caller not allow-listed az keyvault show --query properties.networkAcls Allow IP/subnet; bypass AzureServices; Private Endpoint
403 ForbiddenByRbac RBAC model, no role at this scope Role missing or wrong scope IAM blade on the vault/object Assign role at the right scope
404 SecretNotFound Object/version doesn’t exist Wrong name, deleted, or wrong vault az keyvault secret show; list-deleted Fix name/URI; recover if soft-deleted
409 Conflict Object in a conflicting state Soft-deleted name reused; concurrent op az keyvault list-deleted Recover/purge; serialize operations
429 Too Many Requests Transaction limit exceeded Burst beyond the per-vault cap ServiceApiResult metric; Retry-After header Cache in-process; exponential backoff; split vaults
500/503 Service error Transient backend issue Rare platform blip Retry with backoff; Service Health Retry; if persistent, support
Disabled secret read Returns failure enabled=false on the version az keyvault secret show --query attributes.enabled Enable it or roll to a good version
Expired (exp) advisory Value still returned exp is advisory, not enforced on read Check attributes.exp Rotate; monitor expiry proactively

The transaction limits that drive throttling — real numbers (subscription-wide, per vault region, and subject to change, so always verify current docs):

Operation class Approx. limit Scope Notes
Secret GET (and other “fast” transactions) ~25,000 / 10 s Per vault The cap you hit by not caching
HSM-key operations (RSA 2048+) lower (hundreds–low-thousands / 10 s) Per vault HSM crypto is slower; budget accordingly
Certificate operations lower than secret GETs Per vault Issuance/renewal are heavier
Managed HSM crypto ops much higher than vault Per HSM pool Dedicated throughput is the point of MHSM
Backup/restore, full key ops much lower Per vault Bulk ops can self-throttle

The three reading notes that save the most time:

Distinction The trap How to tell them apart
403 AccessDenied vs ForbiddenByFirewall Both are “403” but fixes are opposite The error body names it: AccessDenied = grant a role; ForbiddenByFirewall = network ACL
RBAC vault vs access-policy vault Assigning a role on an access-policy vault does nothing enableRbacAuthorization true → use roles; false → use set-policy
429 from your app vs from the platform Looks like a Key Vault outage Non-zero throttle metric + Retry-After → you’re over the cap; cache, don’t blame the service

Architecture at a glance

The diagram traces a secret read exactly as it happens on the wire, then maps the failure classes onto the hops where they bite. Read it left to right. On the far left, an App Service app holds a managed identity (badge 1 — if that identity is missing, no token is ever issued and the whole path is dead). The app asks the platform for a token; the request reaches Entra ID, which issues a short-lived Bearer JWT scoped to https://vault.azure.net. The app presents that token to the Key Vault data plane — but first it must clear two gates: the vault firewall (badge 2 — if public access is disabled and the caller isn’t on the Private Endpoint or an allow-listed network, it’s ForbiddenByFirewall) and the RBAC/access-policy check (badge 3 — the identity needs Key Vault Secrets User at the vault scope, or it’s AccessDenied). Only then does the data plane (badge 4 — and watch the ~25,000-GET-per-10s cap; bursts return 429) reach into the backing objects: the secret, the HSM-backed key, or the certificate.

The right edge shows the lifecycle that keeps it all current: Event Grid raises a NearExpiry event, a Function rotates the credential in its backing service and writes a new version back into the vault (badge 5 — if rotation was never wired, a cert simply expires and TLS goes down). Notice that every successful read converges on the same three facts you confirm during an incident: does the caller have an identity, can it pass the firewall, and does it hold a data-plane role? That ordering — identity, then network, then authorization, then throttle — is the whole diagnostic method. The first question on any Key Vault failure is “is this a 401 (no identity), a 403-firewall (network), a 403-RBAC (no role), or a 429 (throttle)?” — and the diagram tells you which hop owns each.

Azure Key Vault architecture and failure map: an App Service app with a managed identity requests a token from Entra ID, receives a Bearer JWT, and calls the Key Vault data plane through the vault firewall (Private Endpoint, deny public) and the RBAC/access-policy authorization gate to read a secret, an HSM-backed key, or a certificate; on the right, Event Grid raises near-expiry events that trigger a Function to rotate credentials and write new versions back. Numbered badges mark the five failure points — missing managed identity, firewall block, missing data-plane role, throttling at the transaction cap, and unwired rotation — each narrated in the legend as symptom, confirm command, and fix

Real-world scenario

Medivault Health runs a patient-portal API on Azure App Service (Linux, .NET 8) on a P1v3 plan in Central India, with Azure SQL behind it, all under HIPAA-style controls. The platform team is five engineers; the original design stored the SQL connection string and a third-party lab-results API key as plaintext App Service settings, and terminated TLS with a .pfx an engineer renewed by hand each year. Monthly spend was about ₹42,000. Three separate incidents in one quarter forced a redesign, and the redesign was Key Vault, done properly.

The first incident was a near-miss audit finding: the plaintext connection string in app settings, visible to anyone with portal access, failed the access-control review outright. The auditor’s question — “prove who has read this credential and when” — had no answer. The team moved both the connection string and the lab API key into a vault, kv-medivault-prod in RBAC mode with purge protection on, and switched the app to system-assigned managed identity + Key Vault references. The connection string in app settings became @Microsoft.KeyVault(SecretUri=.../secrets/SqlConn/). Reads were now logged, attributable, and rotatable.

The rollout broke in a way that taught the core lesson. On first deploy, the app crash-looped — the SQL connection string resolved to empty. The reflex was to suspect the vault, the network, the secret. The actual cause: the app had a managed identity, and the identity had been created, but the Key Vault Secrets User role assignment had been applied at the resource-group scope on a vault that had been moved to a different RG, so the role didn’t apply at the vault’s actual scope. az role assignment list --assignee <principal> --scope <vaultId> returned nothing for the vault. Re-assigning the role at the vault scope fixed it instantly. The lesson on the wall: a Key Vault reference failing empty is almost always identity-or-role, not the secret.

The second incident was the certificate. The hand-renewed .pfx lapsed because the engineer who tracked it was on leave; the portal threw cert-expiry warnings nobody was watching, and the custom domain went to a browser TLS error for forty minutes during business hours. The fix was to move the certificate into the vault as a managed certificate with auto-renewal (AutoRenew 30 days before expiry), wired to an integrated CA, and to subscribe Event Grid CertificateNearExpiry and CertificateNewVersionCreated events to a Function that re-imported the new cert to the front-end and posted to the team’s Teams channel. The 2am-expiry class of incident was now structurally impossible.

The third was throttling. Under a reporting spike, the API — which read four secrets on every request with no caching — hit the per-vault GET cap and started getting 429s, which surfaced as request failures. Diagnose via the ServiceApiResult metric showed throttled transactions climbing exactly with load. The fix was not a bigger SKU: it was caching the resolved secrets in-process (refreshed every few minutes, plus on SecretNewVersionCreated) so a request did zero Key Vault calls in the hot path. Transaction volume fell ~98%, the 429s vanished, and the architecture was cheaper because Key Vault operations are billed per transaction. Final state: kv-medivault-prod with RBAC, purge protection, a Private Endpoint (public access disabled), managed-identity references, auto-renewing certs, event-driven rotation, and in-process caching. Spend was flat at ₹42,000; the audit passed; the pager went quiet. The incident timeline, because the order of moves is the lesson:

When Symptom Action taken Effect What it should have been
Q1 audit Plaintext secret fails review Move secrets to vault + MI references Logged, attributable, rotatable Never store plaintext in the first place
First deploy App crash-loops, SqlConn empty Suspect vault/network Wasted an hour Check identity + role at vault scope first
+20 min Still empty az role assignment list --scope <vaultId> = none Root cause: role at wrong scope Assign data roles at vault scope
Cert lapse TLS error 40 min, business hours Emergency manual renew Outage over, root cause remains Managed cert + auto-renew + events
+1 week Cert hardened Auto-renew + Event Grid → Function 2am-expiry class eliminated Should have been day-one design
Reporting spike 429s, request failures Suspect Key Vault outage Misdirected Read ServiceApiResult; it’s your throttle
+2 days Throttling fixed Cache secrets in-process −98% transactions, 429s gone, cheaper Never read secrets per-request uncached

Advantages and disadvantages

Centralizing secrets, keys and certificates in a managed, throttled, access-controlled service is overwhelmingly the right call — but it introduces a runtime dependency and a few sharp edges you must design around. Weigh it honestly:

Advantages (why this model helps) Disadvantages (why it bites)
Secrets leave config and source control; an app holds a reference, not the value Adds a runtime dependency — the vault must be reachable and authorized at startup
Every read is logged and identity-attributed — audits become answerable Misconfigured identity/role makes the app crash-loop with empty values (looks like a random failure)
Managed identity means no stored credential to access your secrets The firewall can block your own callers if you lock down without a path/DNS
Rotation becomes a single operation; certs auto-renew Certificate consumers (App Gateway/Front Door) may need a re-import after renewal
Soft-delete + purge protection defend against accidental and malicious deletion Purge protection is irreversible, and soft-delete reserves the name (redeploy gotcha)
HSM/Managed HSM offer FIPS-validated custody when compliance demands it A shared, throttled service — uncached per-request reads hit 429 under load
RBAC gives central governance, inheritance, and PIM Two auth models (RBAC vs access policies) confuse teams; control access ≠ data access
One vault per env/sensitivity cleanly scopes blast radius Per-transaction billing means chatty access costs money as well as throttles

The model is right for essentially every workload that handles secrets — which is all of them. It bites hardest on teams who lock down a vault without testing their own callers’ path, who read secrets per-request without caching, who forget that Key Vault Contributor is not a data reader, and who never wire rotation and then get paged by an expiry. Every disadvantage is manageable — caching defeats throttling, vault-scope role assignment defeats the crash-loop, a Private Endpoint with DNS defeats the lockout — but only if you know they exist, which is the entire point of this article.

Hands-on lab

Stand up a vault, store a secret, grant an app’s managed identity read access, wire a Key Vault reference, and confirm an unauthorized caller is denied — all free-tier-friendly (a vault costs per transaction, effectively pennies; delete at the end). Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-kv-lab
LOC=centralindia
KV=kv-lab-$RANDOM          # globally-unique vault name
APP=app-kv-lab-$RANDOM     # globally-unique app name
az group create -n $RG -l $LOC -o table

Step 2 — Create a vault in RBAC mode with soft-delete (and purge protection off, so you can delete it cleanly).

az keyvault create -n $KV -g $RG -l $LOC \
  --enable-rbac-authorization true \
  --retention-days 7 \
  --sku standard -o table

Expected: a vault row; enableRbacAuthorization true. (We leave purge protection off only because this is a throwaway lab — in production, turn it on.)

Step 3 — Grant yourself a data role, then store a secret. Because the vault is RBAC, even as the creator you need a data role to write a secret:

ME=$(az ad signed-in-user show --query id -o tsv)
az role assignment create --assignee "$ME" --role "Key Vault Secrets Officer" \
  --scope "$(az keyvault show -n $KV -g $RG --query id -o tsv)"

# Give RBAC ~30s to propagate, then set a secret
az keyvault secret set --vault-name $KV --name DemoSecret --value "hello-from-kv" -o table

Expected: the secret object, id ending /secrets/DemoSecret/<version>. If you get 403, the role hasn’t propagated — wait and retry. (This is the “control access ≠ data access” lesson, live.)

Step 4 — Create an app with a managed identity.

az appservice plan create -n plan-kv-lab -g $RG --is-linux --sku B1 -o table
az webapp create -n $APP -g $RG -p plan-kv-lab --runtime "DOTNETCORE:8.0" -o table
az webapp identity assign -n $APP -g $RG -o table

Step 5 — Grant the app’s identity read-only access and wire a Key Vault reference.

PRINCIPAL=$(az webapp identity show -n $APP -g $RG --query principalId -o tsv)
az role assignment create --assignee "$PRINCIPAL" --role "Key Vault Secrets User" \
  --scope "$(az keyvault show -n $KV -g $RG --query id -o tsv)"

SECRET_URI=$(az keyvault secret show --vault-name $KV --name DemoSecret --query id -o tsv)
# Strip the version to follow rotation (unversioned reference)
BASE_URI=$(echo "$SECRET_URI" | sed 's#/[^/]*$#/#')
az webapp config appsettings set -n $APP -g $RG \
  --settings "DemoSecret=@Microsoft.KeyVault(SecretUri=$BASE_URI)" -o table

Step 6 — Confirm the reference resolved (not empty). In the portal: the app’s Environment variables blade shows DemoSecret with a green “resolved” status (an error icon means identity/role/firewall — exactly the failure table above). Via CLI you can verify the setting is the reference:

az webapp config appsettings list -n $APP -g $RG \
  --query "[?name=='DemoSecret'].{name:name, value:value}" -o table

Step 7 — Prove unauthorized access is denied. Remove the app’s role and confirm a read would now fail (the reference would resolve empty on next refresh):

az role assignment delete --assignee "$PRINCIPAL" --role "Key Vault Secrets User" \
  --scope "$(az keyvault show -n $KV -g $RG --query id -o tsv)"
# Re-add it so the app keeps working (or leave removed to observe the crash-loop)
az role assignment create --assignee "$PRINCIPAL" --role "Key Vault Secrets User" \
  --scope "$(az keyvault show -n $KV -g $RG --query id -o tsv)"

Validation checklist. You created an RBAC vault, learned that creating it doesn’t grant data access, stored and read a secret, gave an app a credential-free identity, wired a Key Vault reference, and saw that removing the role is what breaks it. The steps mapped to what each proves:

Step What you did What it proves Real-world analogue
2 RBAC vault, soft-delete The secure default posture Every production vault
3 Assign a data role to yourself Control access ≠ data access The #1 “why 403” confusion
5 MI + Secrets User + KV reference Secrets with zero stored creds The canonical app pattern
6 Check the reference resolved The reference-status diagnostic First look when a setting “won’t take”
7 Remove the role Role-or-identity is what breaks references The empty-value crash-loop, live

Cleanup (avoid lingering charges and free the vault name).

az group delete -n $RG --yes --no-wait
# Because soft-delete reserves the name, purge it if you want the name back immediately:
az keyvault purge -n $KV  # only works with purge protection OFF (as in this lab)

Cost note. A B1 plan is a few rupees per hour and Key Vault transactions are fractions of a paisa each — an hour of this lab is well under ₹50, and deleting the resource group stops everything. Remember az keyvault purge is required to fully release the name (soft-delete keeps it reserved otherwise).

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you can read mid-incident, then the entries that bite hardest with full confirm-command detail.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 App crash-loops; secret-backed setting resolves empty App has no managed identity az webapp identity show -n <app> -g <rg> (empty) az webapp identity assign; then grant the role
2 403 AccessDenied reading a secret, identity exists No data-plane role assigned (or wrong scope) az role assignment list --assignee <principal> --scope <vaultId> Assign Key Vault Secrets User at the vault scope
3 403 ForbiddenByFirewall from your own app Firewall default-deny, caller not allow-listed; or public disabled, no PE az keyvault show --query properties.networkAcls; publicNetworkAccess Allow subnet / bypass AzureServices / add Private Endpoint + DNS
4 “Key Vault Contributor” still can’t read secrets Confusing control plane with data plane IAM blade: they have Contributor, no data role Add a data-plane role (Secrets User/Officer)
5 Assigning an access policy “does nothing” Vault is in RBAC mode (enableRbacAuthorization) az keyvault show --query properties.enableRbacAuthorization Use az role assignment (not set-policy) on RBAC vaults
6 Intermittent failures / 429 under load Throttling — reading secrets per-request, uncached ServiceApiResult metric throttled > 0; Retry-After header Cache in-process; exponential backoff; split vaults
7 Rotated secret not picked up by the app KV reference cached, or versioned URI pinned App restarts pick it up; check URI has a version Use unversioned URI; restart/refresh; handle NewVersionCreated
8 TLS broke when the cert “renewed” App Gateway/Front Door still serving old cert Compare served thumbprint to vault current version Re-import cert (or MI-sync) on CertificateNewVersionCreated
9 Certificate silently expired No auto-renewal lifetime action wired az keyvault certificate show --query policy.lifetimeActions Add AutoRenew lifetime action; subscribe NearExpiry events
10 VaultAlreadyExists / can’t recreate a vault A same-named vault is soft-deleted (name reserved) az keyvault list-deleted az keyvault recover, or purge (if PP off), or rename
11 Can’t disable purge protection Purge protection is irreversible az keyvault show --query properties.enablePurgeProtection = true Cannot disable; wait out retention or recreate the vault
12 Read returns success but value is wrong/empty Secret disabled, expired (advisory), or wrong version az keyvault secret show --query "{en:attributes.enabled, exp:attributes.exp}" Enable / roll forward / fix the URI
13 App Configuration KV reference unresolved App Config’s identity lacks Secrets User App Config “Key Vault reference” error status Grant App Config’s managed identity Secrets User on the vault
14 Crypto op fails on a key (encrypt/sign) Key lacks that operation in key_ops, or wrong algorithm az keyvault key show --query key.keyOps Grant the op / pick a supported algorithm; Crypto User role

The expanded form, with full reasoning for the entries that bite hardest:

1. App crash-loops and a secret-backed setting resolves to empty. Root cause: The app has no managed identity, so no Entra token is issued and the Key Vault reference resolves to nothing. Confirm: az webapp identity show -n <app> -g <rg> returns empty/null; the portal Environment variables blade shows the reference with a red error. Fix: az webapp identity assign (system-assigned) or attach a user-assigned identity, then grant it the data role (mistake #2 is the very next step people forget).

2. 403 AccessDenied reading a secret even though the identity exists. Root cause: The identity has no data-plane role, or the role was assigned at the wrong scope (e.g. an RG that no longer contains the vault, as in the Medivault story). Confirm: az role assignment list --assignee <principal> --scope $(az keyvault show -n <kv> -g <rg> --query id -o tsv) returns nothing. Fix: az role assignment create --assignee <principal> --role "Key Vault Secrets User" --scope <vaultId> — assign at the vault scope (or object scope for finer control).

3. 403 ForbiddenByFirewall from your own application. Root cause: The vault firewall is default-deny and the caller isn’t allow-listed, or public access is disabled with no Private Endpoint/DNS for the caller. Confirm: az keyvault show -n <kv> -g <rg> --query "{acls:properties.networkAcls, pna:properties.publicNetworkAccess}"; the error body says ForbiddenByFirewall (not AccessDenied). Fix: Add the caller’s subnet/IP, set bypass AzureServices for first-party callers, or (the strong form) add a Private Endpoint with a Private DNS zone so the hostname resolves privately.

4. Someone with Key Vault Contributor still can’t read a secret. Root cause: Control plane ≠ data plane. Contributor manages the vault but grants no access to the objects inside. Confirm: IAM blade shows Contributor but no Secrets/Crypto/Certificates data role. Fix: Assign the appropriate data-plane role. Management access never implies data access — by design.

5. Adding an access policy has no effect. Root cause: The vault is in Azure RBAC mode (enableRbacAuthorization = true), so the access-policy list is ignored. Confirm: az keyvault show --query properties.enableRbacAuthorization returns true. Fix: Use az role assignment create instead of az keyvault set-policy. (Pick one model per vault and stick to it.)

6. Intermittent failures and 429s under load. Root cause: Throttling — the app reads secrets on every request without caching and exceeds the per-vault transaction cap. Confirm: The ServiceApiResult metric shows throttled results climbing with load; responses carry a Retry-After header. Fix: Cache the resolved secrets in-process (refresh on an interval and on SecretNewVersionCreated); add exponential backoff; for genuinely high volume, split across vaults. A bigger SKU does not fix this.

7. A rotated secret isn’t picked up. Root cause: The Key Vault reference is cached, or you referenced a versioned URI that pins an old version. Confirm: The reference URI ends in a version GUID; restarting the app picks up the new value. Fix: Use the unversioned URI to follow rotation; restart/refresh the consumer; handle SecretNewVersionCreated to refresh caches deliberately.

8. TLS broke right after a certificate “renewed.” Root cause: The renewal created a new version in the vault, but the consumer (Application Gateway, Front Door) is still serving the old cert because it wasn’t re-imported/synced. Confirm: Compare the thumbprint the endpoint serves against the vault’s current certificate version. Fix: Re-import the cert to the consumer (or rely on managed-identity cert integration), triggered by the CertificateNewVersionCreated event.

9. A certificate silently expired. Root cause: No auto-renewal lifetime action was configured (or it was EmailContacts, which only warns). Confirm: az keyvault certificate show --query policy.lifetimeActions shows no AutoRenew trigger. Fix: Add an AutoRenew lifetime action (e.g. 30 days before expiry) and subscribe CertificateNearExpiry/CertificateNewVersionCreated events so renewal is verified and propagated.

10. You can’t recreate a vault — VaultAlreadyExists. Root cause: A previously-deleted, same-named vault is soft-deleted and still holding the globally-unique name. Confirm: az keyvault list-deleted shows it with a scheduledPurgeDate. Fix: az keyvault recover -n <name> to bring it back (and let IaC adopt it), or az keyvault purge if purge protection is off and policy allows, or choose a different name.

11. You can’t turn off purge protection. Root cause: Purge protection is irreversible by design. Confirm: az keyvault show --query properties.enablePurgeProtection is true. Fix: There is none for the existing vault — wait out retention for soft-deleted objects, or stand up a new vault if you genuinely need a no-PP vault (rare; PP is the safer default).

Best practices

The alerts worth wiring before the next incident — leading indicators, not the lagging “app down”:

Alert on Signal / metric Threshold (starting point) Why it’s leading
Throttling ServiceApiResult (throttled) > 0 sustained 5 min First sign of uncached per-request reads before 429s cascade
Cert near-expiry CertificateNearExpiry event / days-to-expiry < 30 days Catches a renewal that didn’t fire before TLS breaks
Secret near-expiry SecretNearExpiry event < 14 days Rotate before consumers fail on a stale credential
Unauthorized access 403 result count spike above baseline Misconfig or an actual access attempt
Availability Vault availability metric < 99.9% Platform issue vs your config — rule it in/out fast
Saturation toward cap Total transactions / 10 s approaching the GET cap You’re about to throttle; add caching now

Security notes

The security knobs that also prevent incidents — secure and resilient pull the same direction here:

Control Setting / mechanism Secures against Also prevents
Managed identity + KV references identity + @Microsoft.KeyVault(...) Plaintext secrets in config Hand-rolled credentials drifting/leaking
Azure RBAC, least privilege Key Vault Secrets User at vault scope Over-broad access to secrets Officer-role mistakes; lateral movement
Soft-delete + purge protection enableSoftDelete, enablePurgeProtection Malicious/accidental deletion Painful unrecoverable loss; redeploy-after-delete
Private Endpoint + public disabled publicNetworkAccess: 'Disabled' + PE Internet-exposed secrets Some firewall lockout classes (with DNS done right)
Diagnostic logs to Log Analytics Vault diagnostic settings Unauditable access Slow incident triage
HSM / Managed HSM Premium / Managed HSM Key material in software Failed compliance audits
Policy: enforce vault standards Azure Policy (deny/audit) Drifting, insecure vaults One team’s mistake becoming estate-wide

Cost & sizing

The bill drivers and how they interact with the design:

A rough monthly picture: a typical app’s Key Vault footprint (a Standard vault, a handful of secrets, a managed cert, sane caching) is often ₹0–200/month — operations are that cheap when you cache. Add a Private Endpoint (~₹600–900/month) for production isolation. Premium adds per-HSM-key charges; Managed HSM is a different order of magnitude (hourly per pool — for estates with real compliance throughput, not single apps). Medivault’s vault cost stayed in the low hundreds of rupees even after Private Endpoint, because caching cut transactions ~98%. The cost drivers and what each buys you:

Cost driver What you pay for Rough INR / month What it fixes / enables Watch-out
Standard vault operations Per-transaction secret/key/cert ops ~₹0–200 (with caching) The base service Uncached per-request reads → bill + 429
Private Endpoint Hourly + per-GB ~₹600–900 Vault off the public internet Needs VNet + Private DNS
Premium HSM keys Per HSM-key/month + ops varies per key FIPS 140-2 L2 custody Surcharge per key; only with a mandate
Managed HSM Fixed hourly per HSM pool high (enterprise) L3, single-tenant, high throughput Not for a single app’s secrets
Diagnostic logs Per-GB ingested to Log Analytics ~₹100–500 Audit trail / alerting Volume tracks (uncached) read volume
Certificate renewals Per renewal/op (+ CA cost separately) low Auto-renewing TLS Integrated-CA cost is the CA’s

The sizing rule in one line: right-size by transaction volume and custody requirement, not by SKU reflex. Cache to kill volume; choose Standard unless an auditor names a FIPS level; add a Private Endpoint for production; reserve Managed HSM for genuine enterprise crypto throughput.

Interview & exam questions

1. What is the difference between the control plane and the data plane in Key Vault, and why does it trip people up? The control plane (Azure Resource Manager) manages the vault as a resource — create/delete, firewall, SKU, configure RBAC mode — governed by roles like Key Vault Contributor. The data plane governs the objects inside — get a secret, sign with a key — governed by data roles like Key Vault Secrets User or access policies. It trips people up because Contributor can manage the vault but cannot read a secret; management access is not data access.

2. An app’s Key Vault reference resolves to an empty value and the app crash-loops. What are the two most likely causes? Either the app has no managed identity (so no token is issued — check az webapp identity show), or the identity exists but has no data-plane role (or it’s assigned at the wrong scope — check az role assignment list --scope <vaultId>). Fix by enabling the identity and assigning Key Vault Secrets User at the vault scope.

3. When would you choose Azure RBAC over the access-policy model? Essentially always for new vaults: RBAC gives inheritance (MG→sub→RG→vault→object), central governance via az role assignment, just-in-time elevation through PIM, and object-level scope. Access policies are a flat per-vault list capped at 1024 entries with no PIM. Each vault uses exactly one model, set via enableRbacAuthorization.

4. What do soft-delete and purge protection do, and what’s the catch with purge protection? Soft-delete (always on now) keeps a deleted vault/object recoverable for a 7–90 day retention window. Purge protection blocks anyone from permanently purging during that window — defeating a malicious “delete and purge everything.” The catch: purge protection is irreversible once enabled, and it makes the retention period a hard floor.

5. Why does a TLS certificate appear as three objects in Key Vault? A certificate object bundles the X.509 cert, its private key (stored as a Key Vault key), and the exportable PFX/PEM (stored as a Key Vault secret) — so the same name is addressable as a certificate, a key, and a secret. You use the certificate object for lifecycle/renewal, the key for operations without exporting, and the secret to import the full PFX into App Service or Application Gateway.

6. An app intermittently gets 429 from Key Vault under load. What’s happening and how do you fix it? Key Vault is a throttled, shared service with a per-vault transaction cap (~25,000 fast transactions / 10 s). An app reading secrets per request without caching exceeds it under load and gets 429 with a Retry-After. The fix is in-process caching (refresh on an interval and on SecretNewVersionCreated) plus exponential backoff — not a bigger SKU.

7. You can’t recreate a vault — VaultAlreadyExists — but you don’t see it in the portal. Why? A previously-deleted, same-named vault is soft-deleted and still reserving the globally-unique name. Confirm with az keyvault list-deleted. Recover it (az keyvault recover) and let IaC adopt it, purge it (if purge protection is off and policy permits), or pick a new name.

8. What’s the difference between Standard, Premium, and Managed HSM? Standard stores software-protected keys (and is fine for most secrets/certs). Premium adds HSM-protected keys on shared FIPS 140-2 Level 2 HSMs. Managed HSM is a single-tenant pool of FIPS 140-2 Level 3 HSMs with its own RBAC and much higher throughput, billed at a fixed hourly rate per pool. Choose by required assurance level and isolation, not by feature envy.

9. How do you make a database password rotate automatically with Key Vault? A rotation policy sets an expiry; a SecretNearExpiry event via Event Grid triggers a Function that rotates the credential in the database and writes the new value back as a new secret version. Consumers using the unversioned reference pick up the new version (on restart/refresh). Key Vault alone can’t change the backing system — the Function does that half.

10. You locked a vault to a Private Endpoint and now your own app gets 403. What went wrong? Either the firewall is default-deny without the caller’s network allowed, public access is disabled without a working Private Endpoint + Private DNS for the caller, or you didn’t set bypass AzureServices for a first-party caller. The 403 body says ForbiddenByFirewall (network), distinct from AccessDenied (missing role). Fix the network path/DNS or allow-list, don’t touch the role.

11. What’s the difference between referencing a versioned and an unversioned secret URI? An unversioned URI (/secrets/Name/) follows the current version, so rotation flows through without changing the reference. A versioned URI pins an exact version — deterministic and audited, but it won’t pick up rotation. Use unversioned to auto-follow rotation, versioned when you need a fixed, reviewed value.

12. Why is Key Vault Contributor insufficient to let someone read a secret, and what would you assign instead? Contributor is a control-plane role — it manages the vault but grants no data-plane access to the objects inside, by design (separation of duties). To read secret values you assign a data-plane role: Key Vault Secrets User (read) or Secrets Officer (CRUD), at the vault or object scope.

These map to AZ-500 (Security Engineer)manage Key Vault, secrets, keys, certificates, RBAC, network restrictions — and AZ-204 (Developer Associate)secure app configuration data using Key Vault and managed identities. The networking angle (Private Endpoint, firewall) touches AZ-700, and governance (Policy enforcing vault standards) touches AZ-305. A compact cert-mapping for revision:

Question theme Primary cert Exam objective area
Control vs data plane, RBAC vs policies AZ-500 Manage Key Vault access
Managed identity + KV references AZ-204 / AZ-500 Secure app config; managed identities
Soft-delete, purge protection, recovery AZ-500 Configure Key Vault security
HSM / Managed HSM, FIPS levels AZ-500 Key management & custody
Private Endpoint / firewall AZ-700 / AZ-500 Secure PaaS connectivity
Rotation, certificates, Event Grid AZ-204 / AZ-500 Implement secure secret rotation
Policy enforcing vault standards AZ-305 Design governance

Quick check

  1. Someone has Key Vault Contributor on a vault but gets 403 reading a secret. Why, and what do you assign instead?
  2. An app’s Key Vault reference resolves to an empty value and the app crash-loops. Name the two things to check, in order.
  3. True or false: a bigger vault SKU is the correct fix for 429 throttling errors under load.
  4. You enabled purge protection last week and now want to disable it. Can you, and why or why not?
  5. A custom-domain TLS certificate stored in Key Vault expired despite being “managed.” What was almost certainly not configured?

Answers

  1. Key Vault Contributor is a control-plane role — it manages the vault (firewall, SKU, policies) but grants no data-plane access to the objects inside, by design. Assign a data-plane role instead: Key Vault Secrets User (read) or Secrets Officer (CRUD), at the vault or object scope.
  2. First, does the app have a managed identity (az webapp identity show — if empty, no token is issued; az webapp identity assign). Second, does that identity have a data-plane role at the vault’s actual scope (az role assignment list --assignee <principal> --scope <vaultId> — if empty, assign Key Vault Secrets User at the vault scope). It’s almost always identity-or-role, not the secret.
  3. False. 429 is throttling against a per-vault transaction cap; a bigger SKU doesn’t raise it. The fix is in-process caching (read the secret once, refresh on an interval and on SecretNewVersionCreated) plus exponential backoff. Managed HSM has higher throughput, but the real fix is to stop reading secrets per request.
  4. No. Purge protection is irreversible by design — once enabled it cannot be turned off for the life of the vault, and the retention period becomes a hard floor. If you genuinely need a no-PP vault you must create a new one (rare; PP is the safer default).
  5. Auto-renewal — an AutoRenew lifetime action on the certificate policy (e.g. renew 30 days before expiry), and ideally a subscription to CertificateNearExpiry/CertificateNewVersionCreated events. A policy set to EmailContacts only warns and doesn’t renew; “stored in Key Vault” is not the same as “set to renew itself.”

Glossary

Next steps

You can now treat secrets, keys and certificates as governed assets and avoid the four failures that page you. Build outward:

AzureKey VaultSecrets ManagementCertificatesEncryptionManaged IdentityRBACHSM
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading