DevOps Azure

Designing Multi-Stage Azure DevOps YAML Pipelines with Environments, Approvals, and Deployment Gates

Classic UI-based release pipelines were easy to click together and impossible to review. Multi-stage YAML pipelines put the entire promotion path — build, dev, test, prod, approvals, and gates — under version control next to the code it ships. This guide builds a defensible, production-grade pipeline that takes a single commit from dev to prod with real guardrails.

The multi-stage mental model

A YAML pipeline is a hierarchy: a pipeline contains stages, a stage contains jobs, and a job contains steps. There are two kinds of jobs:

The key mental shift from classic releases: approvals and gates are not pipeline YAML. They are checks attached to a protected resource — almost always an Environment, sometimes a service connection or variable group. The pipeline declares what it wants to deploy and where; the Environment’s checks decide whether it is allowed to proceed. This separation is deliberate: a developer editing the pipeline file cannot remove a production approval, because that approval lives on the Environment, governed by a different permission.

Checks run before the deployment job’s agent is acquired. A blocked approval costs you nothing in agent minutes.

Step 1 — Structure stages and the dependsOn graph

Stages run sequentially by default, each depending on the previous one. Make dependencies explicit so the graph is obvious and so you can fan out later. A deployment job names its target Environment under environment:.

trigger:
  branches:
    include: [ main ]

pool:
  vmImage: ubuntu-latest

stages:
  - stage: Build
    jobs:
      - job: build
        steps:
          - task: DotNetCoreCLI@2
            inputs:
              command: publish
              publishWebProjects: true
              arguments: '--configuration Release --output $(Build.ArtifactStagingDirectory)'
          - publish: $(Build.ArtifactStagingDirectory)
            artifact: app

  - stage: DeployDev
    dependsOn: Build
    jobs:
      - deployment: deploy
        environment: dev
        strategy:
          runOnce:
            deploy:
              steps:
                - download: current
                  artifact: app
                - script: echo "Deploying to dev"

  - stage: DeployTest
    dependsOn: DeployDev
    jobs:
      - deployment: deploy
        environment: test
        strategy:
          runOnce:
            deploy:
              steps:
                - download: current
                  artifact: app
                - script: echo "Deploying to test"

  - stage: DeployProd
    dependsOn: DeployTest
    jobs:
      - deployment: deploy
        environment: prod
        strategy:
          runOnce:
            deploy:
              steps:
                - download: current
                  artifact: app
                - script: echo "Deploying to prod"

Two details worth internalizing:

To gate a stage on a non-default branch or a condition, combine dependsOn with condition:

  - stage: DeployProd
    dependsOn: DeployTest
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))

Step 2 — Model dev/test/prod as Environments

Create one Environment per deployment target under Pipelines -> Environments. Environments give you three things classic releases bolted on awkwardly: deployment history per target, resource scoping (you can add specific Kubernetes namespaces or VMs as resources), and a place to hang checks.

You can create them in the portal or via the CLI:

az pipelines environment create \
  --name prod \
  --organization https://dev.azure.com/contoso \
  --project Payments

Environment names referenced in YAML are created on first run if they do not exist — but an auto-created Environment has no checks. Always pre-create production Environments and attach checks before the pipeline can target them, or your first prod run sails straight through.

For container or VM workloads, target a resource within the Environment using the environment.resourceName form:

      - deployment: deploy
        environment: prod.payments-ns   # Kubernetes resource "payments-ns" in env "prod"
        strategy:
          runOnce:
            deploy:
              steps:
                - task: KubernetesManifest@1
                  inputs:
                    action: deploy
                    manifests: manifests/deployment.yaml

Step 3 — Manual approvals, business-hours, and exclusive locks

Checks are configured on the Environment (the menu -> Approvals and checks). The three you will reach for most:

Approvals. Designate approver users or groups. Use a group, not individuals, so on-call rotation does not break promotions. Set a timeout (the run waits, pending, until then) and decide whether the approver who requested the run may also approve it — for prod, turn that off to enforce four-eyes.

Business hours. A check that only passes during a defined window and time zone. Attach it to prod so a Friday-evening merge queues until Monday morning rather than deploying into the weekend.

Exclusive lock. Guarantees only one run at a time can pass through the Environment. Without it, two merges in quick succession can both enter the prod stage and race. With it, runs serialize; the newer run waits for the lock. Pair this with the pipeline-level setting to cancel superseded runs if you only ever care about the latest commit reaching prod.

Checks have a time out and a separate evaluation retry cadence. An approval that no one actions within its timeout fails the stage — it does not silently pass. Set the timeout to match your real escalation SLA.

You cannot define these checks in pipeline YAML; that is the point of putting them on the resource. What you can version-control is the Environment-and-checks configuration itself, by managing Azure DevOps with Terraform (the azuredevops provider exposes azuredevops_environment and check resources), so even your approval policy is reviewable.

Step 4 — Automated deployment gates

A gate is an automated check that polls an external system and only passes when a condition holds. This is how you stop a promotion when production is already unhealthy. Two built-in checks cover most needs:

Query Azure Monitor alerts. Configure the check with a subscription, resource group, and the alert rules to evaluate. The check passes only when no configured alert is firing. Attach it to prod so an active “5xx spike” or “p99 latency” alert blocks the next deployment automatically.

Invoke REST API. Call any HTTPS endpoint and pass/fail based on the response. Use it to query a change-freeze calendar, a feature-flag service, or your own health endpoint. The check succeeds when the response matches your success criteria; configure it to retry on a cadence so a transient failure does not immediately fail the stage.

A robust gate pattern: give the check a window and an interval (for example, evaluate every 5 minutes for up to 30 minutes). The check must report healthy on each sample before the stage proceeds — a single green blip will not let a flapping service through.

You can also write your own gate purely in YAML by making an early job fail fast on a probe, but prefer the resource-level checks: they run before agent acquisition and apply no matter which pipeline targets the Environment.

      - deployment: deploy
        environment: prod
        strategy:
          runOnce:
            preDeploy:
              steps:
                - script: ./scripts/smoke-check.sh   # in-pipeline guard, complements gates
            deploy:
              steps:
                - download: current
                  artifact: app
                - script: ./scripts/deploy.sh

Step 5 — Template libraries and required-template enforcement

Copy-pasting stages across repos is how pipelines rot. Azure DevOps has two template mechanisms:

Includes templates inject reusable YAML (steps, jobs, or stages) into a pipeline the author controls. Good for sharing a build sequence.

Extends templates invert control: the pipeline extends a template that owns the overall shape, and the template decides which parameterized hooks the consumer may fill. This is the security-relevant one. Combined with a required template check on a protected resource, you can mandate that any pipeline touching prod must extend an approved governance template — there is no way to deploy to that Environment otherwise.

A consumer pipeline:

# azure-pipelines.yml in an app repo
resources:
  repositories:
    - repository: templates
      type: git
      name: Platform/pipeline-templates
      ref: refs/tags/v3

extends:
  template: stages/deploy.yml@templates
  parameters:
    serviceName: payments-api
    environments: [ dev, test, prod ]

The governing template:

# stages/deploy.yml in Platform/pipeline-templates
parameters:
  - name: serviceName
    type: string
  - name: environments
    type: object
    default: [ dev ]

stages:
  - ${{ each env in parameters.environments }}:
      - stage: Deploy_${{ env }}
        jobs:
          - deployment: deploy
            environment: ${{ env }}
            strategy:
              runOnce:
                deploy:
                  steps:
                    - download: current
                      artifact: app
                    - script: ./deploy.sh ${{ parameters.serviceName }} ${{ env }}

The ${{ each ... }} is a compile-time expansion. Template expressions resolve when the YAML is parsed, before runtime — so the loop literally generates one stage per environment in the expanded pipeline. Pin the template repository to a tag (ref: refs/tags/v3), not a branch, so a platform change cannot silently alter every consumer’s prod path.

To enforce it: on the prod Environment, add a Required template check pointing at stages/deploy.yml@templates. Pipelines that do not extend it are rejected at queue time.

Step 6 — Variables, parameters, and keyless deploys

Variable groups hold shared, often-secret values (linked to Key Vault for secrets). Reference them at the stage or job scope so dev values never leak into prod:

  - stage: DeployProd
    variables:
      - group: payments-prod   # variable group, may be Key Vault-backed
    jobs:
      - deployment: deploy
        environment: prod
        # ...

A variable group can itself be a protected resource with its own approval check, so granting a pipeline access to prod secrets requires a sign-off independent of the Environment.

Runtime parameters (parameters: at the top of the pipeline) are typed and shown in the Run pipeline dialog. Unlike variables, they expand at compile time, so they can drive ${{ if }} / ${{ each }} logic — ideal for an optional “deploy hotfix to prod only” toggle.

Keyless deploys with workload identity federation. Do not store cloud secrets in service connections. Create the Azure Resource Manager service connection using Workload Identity Federation: Azure DevOps presents a short-lived OIDC token to Microsoft Entra ID, which exchanges it for an access token via a federated credential on an app registration or managed identity. No client secret is stored, nothing expires under you, and tasks like AzureCLI@2 authenticate transparently:

            deploy:
              steps:
                - task: AzureCLI@2
                  inputs:
                    azureSubscription: payments-prod-wif   # WIF service connection
                    scriptType: bash
                    scriptLocation: inlineScript
                    inlineScript: |
                      az group list --output table

Azure DevOps can convert existing secret-based ARM connections to WIF in place, and it surfaces a warning when a connection still uses an expiring secret. Treat that warning as a backlog item.

Enterprise scenario

A payments platform team ran ~40 microservices, each extending a shared stages/deploy.yml@templates pinned to refs/tags/v3. Prod was protected by a four-eyes approval and an exclusive lock. The incident: a Sev-2 outage needed a hotfix shipped in minutes, but the exclusive lock was held by a long-running canary from an unrelated service that was parked in postRouteTraffic waiting on its Azure Monitor gate. The hotfix run sat queued behind it. Worse, the parked run could not be approved-through because the on-call engineer wasn’t in the approver group — that was the four-eyes design working exactly as intended, against us.

The root cause was scoping the exclusive lock at the whole prod Environment instead of per service. We re-modeled each service as a resource inside the Environment (prod.payments-api, prod.ledger) and moved the lock to the resource via the runOnce deployment targeting that resource, so locks no longer cross service boundaries:

      - deployment: deploy
        environment: prod.payments-api   # per-service resource, lock is scoped here
        strategy:
          runOnce:
            deploy:
              steps:
                - download: current
                  artifact: app
                - script: ./deploy.sh payments-api

We also added a dedicated break-glass pipeline that extends the same governance template but targets a separate prod-hotfix Environment whose approval group includes on-call, with the Azure Monitor gate kept but business-hours dropped. Lesson: an exclusive lock and a global four-eyes approval are correct defaults, but if they share a single blast radius they will eventually serialize an emergency behind a routine deploy. Scope locks to the unit you actually deploy, and pre-build the break-glass path before you need it.

Verify

Run the pipeline from main and confirm each guardrail actually engages:

# Trigger a run and capture its id
az pipelines run \
  --name payments-cd \
  --branch main \
  --organization https://dev.azure.com/contoso \
  --project Payments

# Inspect stage timeline; prod should sit in "pending"/"waiting" until approved
az pipelines runs show \
  --id <runId> \
  --organization https://dev.azure.com/contoso \
  --project Payments \
  --query "{status:status, result:result}"

What “correct” looks like:

Production checklist

Pitfalls

Next, layer in a canary or rolling strategy on the prod deployment job and wire its postRouteTraffic hook to the same Azure Monitor gate — so a bad canary rolls itself back before it ever reaches the full fleet.

Azure DevOpsCI/CDYAMLApprovalsTemplates

Comments

Keep Reading