Blue-green deployment promises a release you can roll back in seconds instead of redeploying under pressure. On Azure this is achievable with native primitives — App Service deployment slots for the swap, Azure Front Door for a gradual edge cutover — but the gap between the demo and a safe production pipeline is where teams get burned. This guide builds the full flow: health-gated swaps, weighted traffic shifting, and a one-action rollback, all driven from CI/CD.
1. Blue-green vs canary vs rolling: pick the strategy
These three terms get used interchangeably and they are not the same. The right choice depends on whether your app is stateful and how much blast radius you can tolerate.
| Strategy | How it works | Rollback | Best for |
|---|---|---|---|
| Rolling | Replace instances in batches in place | Roll forward (slow) | Stateless apps where partial-version overlap is fine |
| Canary | Route a small % to the new version, ramp on metrics | Reduce % to zero | High-traffic services with strong telemetry and SLOs |
| Blue-green | Two full environments; cut all traffic at once after validation | Flip back to the old environment | Apps needing a clean version boundary and instant rollback |
Blue-green’s defining property is two complete, parallel environments — only one serves production at a time. That clean boundary is exactly what makes it friendly to stateful apps: there is never a moment when v1 and v2 both own the same in-process session state, because the cutover is atomic.
On App Service, the two environments are the production slot (green, live) and a staging slot (blue, the candidate). The swap is the atomic cutover. Front Door sits in front and lets you turn that binary cutover into a gradual one when you want canary-style risk reduction at the edge — the best of both models.
The most common mistake is treating slot warm-up as optional. A swap without warm-up is a cold start in disguise: production instances start serving while still JIT-compiling and filling caches. Zero-downtime requires the candidate to be warm before traffic moves.
2. Deployment slots deep dive
A staging slot is a full, addressable copy of the app running on the same App Service Plan. You deploy to it, warm it, validate it, then swap. Slots require Standard tier or higher.
az webapp deployment slot create \
--resource-group rg-app-prod \
--name app-orders-prod \
--slot staging
Slot settings: what travels during a swap
This is the single subtlety that breaks more blue-green setups than anything else. By default, app settings and connection strings follow the slot — they move with the code during a swap. That is correct for things that should promote with the release and catastrophic for environment-specific config (you do not want staging’s database connection string becoming production’s).
Mark environment-specific values as slot settings (a.k.a. “deployment slot setting” or “sticky”) so they stay pinned to the slot and do not travel:
az webapp config appsettings set \
-g rg-app-prod -n app-orders-prod --slot staging \
--slot-settings \
ASPNETCORE_ENVIRONMENT=Staging \
"SqlConnection=@Microsoft.KeyVault(SecretUri=https://kv-orders-prod.vault.azure.net/secrets/sql-conn/)"
| Setting type | Behavior on swap | Use for |
|---|---|---|
| Regular app setting | Travels with the code | Feature flags, tuning that should promote with the release |
| Slot setting (sticky) | Stays pinned to the slot | Environment name, env-specific connection strings, slot-scoped keys |
A useful discipline: connection strings should generally be slot settings, while feature flags and app-version metadata should generally travel. Audit slotSetting: true before every release.
Warm-up so the swap is genuinely zero-downtime
App Service can ping a path on every candidate instance and wait for healthy responses before completing the swap. This is the mechanism that turns a swap from a cold start into a true zero-downtime cutover.
az webapp config appsettings set \
-g rg-app-prod -n app-orders-prod --slot staging \
--slot-settings \
WEBSITE_SWAP_WARMUP_PING_PATH=/health/ready \
WEBSITE_SWAP_WARMUP_PING_STATUSES="200,202"
# Keep the slot from idling out before a swap
az webapp config set -g rg-app-prod -n app-orders-prod --slot staging --always-on true
WEBSITE_SWAP_WARMUP_PING_PATH and WEBSITE_SWAP_WARMUP_PING_STATUSES gate the swap on your readiness endpoint returning an acceptable status on each instance. The endpoint must check real dependencies — database reachable, Key Vault references resolved, cache primed — not return 200 unconditionally. A trivial health check defeats the entire purpose of warm-up gating.
3. The database and stateful-dependency problem
Blue-green’s atomic cutover does not exempt you from the hardest part: both slots talk to the same backing data. The staging slot is not a parallel database; it is a parallel application pointed at the same SQL, the same cache, the same queues. That has consequences.
Schema changes must be backward compatible across the swap window. During preview and immediately after a swap, both old and new code can hit the database simultaneously. The rule is expand/contract (a.k.a. parallel-change):
- Expand: deploy a schema change that is additive only — new nullable columns, new tables, new optional parameters. Old code ignores them; new code uses them.
- Migrate + swap: ship the new code that reads/writes the new shape. Both versions coexist safely because the old columns still exist.
- Contract: in a later release, once nothing runs the old code, drop the deprecated columns.
Never combine a destructive migration (drop column, rename, tighten a constraint) with the same release that depends on it. If you have to roll back, the old code will hit a schema it no longer understands and your “instant rollback” becomes an outage. Destructive changes are always a separate, later release.
For other stateful dependencies:
- In-process session state is the reason to prefer blue-green here: the atomic swap means no request is ever served by a mix of versions. But sessions held in memory are lost on swap — externalize session to Redis or a distributed cache so a cutover does not log everyone out.
- Background workers / queue consumers keep running on the old code until the swap completes, then the new code picks up the same queue. Make message handlers tolerant of being processed by either version during the overlap (idempotent, schema-version-aware).
- Outbound caches and connection pools are cold on the candidate. The warm-up path should prime them so the first real users do not pay the warm-up tax.
4. Health-gated auto-swap with swap-with-preview
The robust production pattern is swap with preview (a two-phase swap). Phase 1 applies the target (production) configuration to the staging slot and restarts it under production config — without moving any traffic. You validate the slot now running production config, then complete the swap.
# Phase 1: apply production config to staging, no traffic moved yet
az webapp deployment slot swap \
-g rg-app-prod -n app-orders-prod \
--slot staging --target-slot production --action preview
# ... run smoke tests against the staging slot, now running prod config ...
# Phase 2: complete the swap (traffic moves atomically)
az webapp deployment slot swap \
-g rg-app-prod -n app-orders-prod \
--slot staging --target-slot production --action swap
If smoke tests fail during preview, abort with --action reset and nothing reaches users:
az webapp deployment slot swap \
-g rg-app-prod -n app-orders-prod \
--slot staging --action reset
The warm-up ping configured in Step 2 runs automatically as part of the swap operation — App Service will not complete the swap until the warm-up statuses pass. So you get two gates: your explicit smoke tests during preview, and the platform’s warm-up gate during completion.
Distinguish the two health paths and do not conflate them:
- Liveness (
/health/live) — is the process up? Used by App Service Health Check to recycle dead instances. - Readiness (
/health/ready) — are dependencies good? Used by the swap warm-up gate.
Wire Health Check on the liveness path so the platform pulls unhealthy instances out of rotation independently of deploys:
az webapp config set -g rg-app-prod -n app-orders-prod \
--generic-configurations '{"healthCheckPath": "/health/live"}'
5. Front Door weighted routing for a gradual edge cutover
A slot swap is binary: 0% then 100%. For high-traffic services you often want to ramp — send 10% to the new version, watch error rates, then ramp to 100%. Azure Front Door Standard/Premium does this with weighted origins in an origin group.
The pattern: register both slots as origins in one origin group. The production slot starts at weight 100, the staging slot at weight 1 (effectively off). After the candidate is validated, you shift weights to ramp traffic, then either complete the cutover or pull it back.
Front Door origin weights are relative, not percentages. Weights of 90 and 10 send roughly 90% and 10% of traffic. Latency-based routing can still influence selection within a priority tier, so for deterministic canary splits keep both origins at the same priority and rely on weight.
Add the staging slot as a second origin (the production slot is assumed already registered):
az afd origin create \
--resource-group rg-app-prod \
--profile-name afd-orders \
--origin-group-name og-orders \
--origin-name origin-staging \
--host-name app-orders-prod-staging.azurewebsites.net \
--origin-host-header app-orders-prod-staging.azurewebsites.net \
--priority 1 \
--weight 1 \
--enabled-state Enabled \
--https-port 443
Ramp traffic by updating weights. Start small:
# 10% to the new version (relative weights 90 / 10)
az afd origin update -g rg-app-prod --profile-name afd-orders \
--origin-group-name og-orders --origin-name origin-production --weight 90
az afd origin update -g rg-app-prod --profile-name afd-orders \
--origin-group-name og-orders --origin-name origin-staging --weight 10
Configure health probes on the origin group so Front Door stops routing to an origin that starts failing — this is your automatic safety net during the ramp:
az afd origin-group update \
-g rg-app-prod --profile-name afd-orders --origin-group-name og-orders \
--probe-path /health/live --probe-protocol Https \
--probe-request-type GET --probe-interval-in-seconds 30 \
--sample-size 4 --successful-samples-required 3
Traffic Manager (DNS-based, with its own weighted routing method) is an alternative when you need cross-region failover or non-HTTP endpoints. But because it works at DNS, cutover and rollback are gated by client DNS TTL caching — Front Door reweights at the edge and takes effect in seconds, which is what you want for canary control. Use Front Door for HTTP apps; reach for Traffic Manager only for the cross-region or protocol cases it uniquely covers.
There are now two complementary cutover mechanisms: the slot swap (atomic, instance-level, the source of truth for “what is production”) and Front Door weights (gradual, edge-level, for risk-managed ramp). A mature flow uses Front Door weights to validate under real traffic, then performs the slot swap to make the new version the true production slot, then resets weights to 100/0 against the (now-swapped) production origin.
6. Automating the full flow in a pipeline
Here is the end-to-end flow as an Azure DevOps multi-stage pipeline using OIDC (workload identity federation via a service connection), so there are no long-lived secrets. The shape maps directly onto GitHub Actions environments if that is your platform.
# azure-pipelines.yml
trigger:
branches: { include: [main] }
variables:
rg: rg-app-prod
app: app-orders-prod
slot: staging
stages:
- stage: Build
jobs:
- job: build
pool: { vmImage: ubuntu-latest }
steps:
- script: |
dotnet publish -c Release -o $(Build.ArtifactStagingDirectory)/app
displayName: Build
- publish: $(Build.ArtifactStagingDirectory)/app
artifact: app
- stage: DeployStaging
dependsOn: Build
jobs:
- deployment: deploy_blue
environment: prod-staging-slot
pool: { vmImage: ubuntu-latest }
strategy:
runOnce:
deploy:
steps:
- download: current
artifact: app
- task: AzureWebApp@1
inputs:
azureSubscription: sc-prod-oidc # OIDC service connection
appName: $(app)
deployToSlotOrASE: true
resourceGroupName: $(rg)
slotName: $(slot)
package: $(Pipeline.Workspace)/app
- stage: Verify
dependsOn: DeployStaging
jobs:
- job: smoke
pool: { vmImage: ubuntu-latest }
steps:
- task: AzureCLI@2
inputs:
azureSubscription: sc-prod-oidc
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
set -euo pipefail
HOST="https://${APP}-${SLOT}.azurewebsites.net"
# Readiness must pass on the candidate before we consider swapping
for i in $(seq 1 10); do
code=$(curl -s -o /dev/null -w "%{http_code}" "$HOST/health/ready")
[ "$code" = "200" ] && echo "ready" && exit 0
echo "attempt $i -> $code"; sleep 15
done
echo "candidate never became ready"; exit 1
env:
APP: $(app)
SLOT: $(slot)
- stage: Swap
dependsOn: Verify
jobs:
- deployment: swap_to_green
environment: prod # attach a manual approval check on this environment
pool: { vmImage: ubuntu-latest }
strategy:
runOnce:
deploy:
steps:
- task: AzureCLI@2
inputs:
azureSubscription: sc-prod-oidc
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
set -euo pipefail
az webapp deployment slot swap \
-g "$RG" -n "$APP" --slot "$SLOT" --target-slot production
env:
RG: $(rg)
APP: $(app)
SLOT: $(slot)
The approval gate lives on the prod environment (Azure DevOps environment checks, or a GitHub Actions environment with required reviewers). The pipeline deploys to blue, runs automated verification, pauses for human approval, then performs the swap. The warm-up ping gates the swap itself at the platform level, so even an approved swap will not complete against unhealthy instances.
If you want the gradual Front Door ramp inside the pipeline, insert a stage between Verify and Swap that bumps weights to 10/90, runs a timed observation window querying Front Door metrics or Application Insights failure rate, and only proceeds on a clean window.
7. Instant rollback patterns
The whole point of blue-green is that rollback is fast and boring. There are three rollback levers, and which one you reach for depends on when the regression surfaces.
During preview (before completion): abort. Nothing reached users.
az webapp deployment slot swap -g rg-app-prod -n app-orders-prod --slot staging --action reset
During a Front Door ramp (partial traffic): reweight to zero. Takes effect at the edge in seconds, far faster than a swap or redeploy.
az afd origin update -g rg-app-prod --profile-name afd-orders \
--origin-group-name og-orders --origin-name origin-production --weight 100
az afd origin update -g rg-app-prod --profile-name afd-orders \
--origin-group-name og-orders --origin-name origin-staging --weight 0
After a completed swap (100% on new version): swap back. The previous production bits are sitting in the staging slot, so rollback is another swap — not a redeploy.
az webapp deployment slot swap -g rg-app-prod -n app-orders-prod --slot staging --target-slot production
What to test first on rollback: confirm the data layer is compatible with the version you are rolling back to. This is why the expand/contract discipline in Step 3 is non-negotiable — if the failed release ran a destructive migration, a swap-back returns the old code to a schema it cannot read, and you have traded a bad deploy for a hard outage. Verify schema compatibility before you trust swap-back as your rollback.
Enterprise scenario
A payments team ran their orders API behind App Service slots with Front Door weighted ramp, and it worked flawlessly in staging. The first production ramp to 10% triggered a flood of duplicate-charge alerts within ninety seconds. The cause was not the deploy mechanics — it was sticky sessions. They had session affinity enabled on the Front Door origin group, so returning users were pinned to the production origin while new sessions scattered across both. A user who began checkout on the old origin and got reweighted mid-flow hit the new code’s idempotency logic, which keyed off a header the old version never set. Two versions, one payment, no shared idempotency key.
The fix had two parts. First, disable affinity for the canary window so the split is honest and every request is independently routable:
az afd origin-group update -g rg-app-prod \
--profile-name afd-orders --origin-group-name og-orders \
--enable-session-affinity false
Second — the real lesson — the idempotency key had to be derived from request content, not a server-set header, so it stayed stable across both versions during the overlap. They moved to a client-supplied Idempotency-Key validated server-side, deployed it as a backward-compatible expand release one sprint ahead of the ramp, and only then resumed weighted cutovers.
The principle: blue-green and canary make two versions serve real users simultaneously. Any state that must be consistent across that boundary — idempotency keys, session tokens, cache key shapes — has to be version-agnostic before you split traffic, not after. Affinity hides the problem in test and detonates it in production.
Verify
# Slot settings are sticky (env-specific keys must show slotSetting: true)
az webapp config appsettings list -g rg-app-prod -n app-orders-prod --slot staging \
--query "[?slotSetting].name" -o tsv
# Candidate readiness passes on the staging slot before any swap
curl -s -o /dev/null -w "%{http_code}\n" \
https://app-orders-prod-staging.azurewebsites.net/health/ready # expect 200
# Confirm which version each slot currently serves (expose build SHA at an endpoint)
curl -s https://app-orders-prod.azurewebsites.net/version
curl -s https://app-orders-prod-staging.azurewebsites.net/version
# Front Door origin weights are where you expect during/after a ramp
az afd origin list -g rg-app-prod --profile-name afd-orders \
--origin-group-name og-orders --query "[].{name:name,weight:weight,priority:priority}" -o table
# Front Door is routing to a healthy origin end to end
curl -s -o /dev/null -w "%{http_code}\n" https://<your-frontdoor-endpoint>/health/live
A swap is correct when production serves the new build SHA, the previous SHA is now in staging (ready for swap-back), and Front Door reports both origins healthy with the expected weights.
Release checklist
Pitfalls nobody documents
- Plan capacity during a swap. Both slots share the App Service Plan. A warm staging slot consumes the same instance pool, so size autoscale
max-countwith headroom or a deploy can starve production of capacity. Keepmin-countat 2+ so production survives an instance recycle mid-swap. - Cost of running two environments. Slots themselves are free, but the warm staging instances are not — they bill against the shared plan. The honest cost of blue-green is the headroom you keep for the candidate, plus Front Door’s request/data charges if you front it.
- Connection draining is not instant. A swap moves the hostname mapping, but in-flight requests on the old instances need to finish. Keep requests short and idempotent; long-running synchronous requests can be cut off at the cutover boundary.
- Forgetting diagnostics on the staging slot. A slot is a distinct resource and does not inherit diagnostic settings or App Insights wiring. Configure them on the slot too, or you go blind exactly when validating a candidate.
- Trusting weights as percentages. Front Door weights are relative and interact with priority and latency routing. For a clean canary split, keep both origins at the same priority and verify the actual split in metrics rather than assuming.
Build the swap-with-preview flow first, make warm-up gate on a readiness check that means something, and rehearse all three rollback levers before you need them. Done that way, a bad release is a non-event: you swap back in seconds and debug at leisure, instead of redeploying into a live incident.