Architecture Azure

Azure Landing Zone: Platform Automation & DevOps — IaC with Bicep & Terraform, the ALZ Accelerator, Subscription Vending, Platform CI/CD & GitOps

Where this fits

Platform Automation & DevOps is the eighth and final design area of the Azure Landing Zone (ALZ) conceptual architecture, and it is the one that makes the other seven durable. Identity, Network Topology, Resource Organization, Security, Management, Governance and the Billing/Entra-tenant decisions all describe a desired state; Platform Automation & DevOps decides how that state is expressed as code, reviewed, tested, deployed, and continuously reconciled so it never drifts. Concretely, this design area answers: in what language is the platform defined (Bicep, Terraform, or both)? Which accelerator bootstraps the management-group tree, policies and platform subscriptions? How does an application team get a governed subscription on demand (subscription vending)? What CI/CD deploys changes to the platform itself, with what gates and what identity? And what operating model does the platform team run so that “the landing zone” is a product with versioned releases rather than a one-off click-ops build that decays the day after go-live? Skip this design area and you have a beautiful architecture that is true exactly once; do it well and every other design area becomes a pull request.

Azure Landing Zone Design Areas — animated overview

Infrastructure as Code with Bicep and Terraform

What it is

Infrastructure as Code (IaC) is the practice of declaring your Azure resources — management groups, policy assignments, subscriptions, hub networks, Log Analytics workspaces, the lot — in version-controlled, machine-readable definitions that a deployment engine reconciles against the live tenant. For Azure landing zones there are two first-class, Microsoft-supported choices, and the ALZ ships both as maintained reference implementations:

Why it matters

The landing zone is governance infrastructure: a Deny policy at the intermediate-root management group affects every subscription in the tenant. You cannot afford for that to be a hand-edited portal setting that one admin remembers. IaC gives you review (PR + approvals), audit (git history of who changed which guardrail and why), repeatability (rebuild the platform in a DR tenant or a second geo), testability (validate and what-if before apply), and drift detection (the live tenant is continuously compared to the declared state). It is also the only sane way to operate at scale: hundreds of policy assignments, dozens of subscriptions and spoke networks are not maintainable by clicking.

Bicep vs Terraform for the platform — the real decision

This is the single most consequential decision in this design area. The honest answer: both are fully supported and both deploy the same ALZ reference architecture; choose on team skills and existing estate, not on a feature scoreboard.

Dimension Bicep Terraform
Vendor / support Microsoft-native, ships with Azure CLI; ARM is the engine HashiCorp + Microsoft-published modules; AzureRM/AzAPI providers
State management No external state — ARM stores deployments at the scope; nothing to secure/lock yourself Explicit state file you must host (Azure Storage backend) and lock (blob lease)
Multi-cloud Azure only Cloud-agnostic — preferred if you also run AWS/GCP with one toolchain
Day-0 coverage of new Azure features Often same-day (ARM-native) Lags until the AzureRM provider adds it; AzAPI provider closes the gap
Preview before apply az deployment ... what-if terraform plan (clearer diff, the gold standard)
Drift handling What-if on redeploy; ARM deployments are idempotent plan shows drift; can import existing resources
Modularity Modules + template specs; registry via ACR or Bicep public registry Modules + a rich public/private registry; Azure Verified Modules (AVM)
ALZ reference impl ALZ-Bicep Azure/avm-ptn-alz, caf-enterprise-scale
Best fit Azure-only shop, .NET/Microsoft-aligned team, wants zero state ops Multi-cloud, existing Terraform skills, wants explicit plan/state

Pragmatic guidance I give clients: pick one as the platform’s primary language and standardize. Mixing Bicep and Terraform across the same layer multiplies the skills you must hire for and the failure modes you must debug. A common, legitimate split is Terraform for the platform landing zone (because the org already runs Terraform for everything) while application teams use whatever they like inside their own subscription — the guardrails are enforced by Policy, not by mandating the app team’s IaC tool.

How to do it well

Concrete artifacts, decisions, and tools

The landing-zone accelerator

What it is

The Azure landing zone accelerator is Microsoft’s opinionated, ready-to-deploy implementation of the entire ALZ conceptual architecture — the management-group hierarchy, the platform subscriptions (Management, Connectivity, Identity), the Azure Policy initiatives and assignments, RBAC, the Log Analytics workspace, hub networking, and Defender for Cloud — bootstrapped from a single configurable starting point. It comes in three flavours that all converge on the same reference design:

Why it matters

The accelerator is the difference between spending months hand-assembling a CAF-compliant platform and standing up a known-good, Microsoft-maintained baseline in days — one that already encodes the design decisions of the other seven design areas (the canonical MG tree, the ALZ custom policy set, the hub-spoke topology). Just as importantly, it gives you a supported upgrade path: as Microsoft improves the reference (new policies, new well-architected defaults), you consume those as versioned module updates rather than re-architecting. It is the fastest correct start and the thing that keeps your platform aligned with CAF over time.

Bootstrap vs platform — the key mental model

The Terraform accelerator draws a deliberate line you should preserve:

Phase What it creates Run how often
Bootstrap The DevOps scaffolding: the git repos, the CI/CD pipelines/workflows, the deployment identity (a Microsoft Entra app/managed identity with federated credentials / OIDC), the Terraform state storage account, and the agent/runner config Once, to set up the factory
Platform (continuous) The actual landing zone — MGs, policies, platform subs, hub network, Log Analytics — deployed by the pipelines the bootstrap created Every change, forever, via PRs

Internalising this prevents the classic mistake of treating the accelerator as a one-shot installer. The accelerator’s real value is that the platform phase is now a CI/CD loop you own.

How to do it well

Concrete artifacts, decisions, and tools

Subscription vending

What it is

Subscription vending is the automated, self-service flow by which an application team receives a fully governed, network-connected, policy-compliant application landing zone subscription on demand — without a central team hand-building it. It is the productized answer to “how do app teams get into the platform,” and it is the practical embodiment of the ALZ principle of democratization (subscriptions are a unit of scale you mint freely, inside guardrails, rather than a precious thing gated by tickets). The reference implementation is the Azure/lz-vending Terraform module (with Bicep equivalents); it is the natural companion to the platform accelerator.

Why it matters

The subscription is Azure’s primary boundary for scale (quotas/limits), cost, RBAC/policy inheritance, and blast radius. If creating one is a manual, multi-week ticket, you get two failure modes: cloud velocity dies, and teams route around you (shadow IT, everything crammed into one mega-subscription that hits limits and muddies cost). Vending fixes both: app teams move at their own pace, and the platform team governs by policy-as-code applied at birth — the subscription is compliant the instant it exists because it lands under a governed management group with the right policies inherited. It is also where the platform’s self-service contract lives.

What a vending request actually does

A good vending pipeline turns a small metadata request into a complete, compliant subscription. The steps the lz-vending module orchestrates:

Step What happens Why
Create / reference subscription Mint a new subscription under the right billing scope (EA enrollment account or MCA invoice section), or alias an existing one Correct cost lineage from day one
Management-group placement Place it in the correct MG (e.g. Corp / Online) Inherits the right guardrails immediately
RBAC assignment Grant the app team Owner/Contributor scoped to their subscription only Self-service without platform access
Networking Create a spoke vNet and peer it to the regional hub (or connect to Virtual WAN); register Private DNS Connected and resolvable, hub-routed
Budget + tags + alerts Apply a Cost Management budget with action-group alerts and mandatory tags (cost center, owner, env, classification) Cost control and showback at birth
Resource providers + baseline Register required resource providers; seed baseline RGs/Key Vault/identities Ready to deploy into

How to do it well

Concrete artifacts, decisions, and tools

CI/CD for the platform

What it is

Platform CI/CD is the pipeline that takes a change to the landing zone itself — a new policy assignment, an MG tweak, a hub-network change, a new vended subscription — from a pull request through validation, preview, approval, and deployment, with the right identity, gates, and environment promotion. It is distinct from application CI/CD (which deploys app code into a landing zone). Here the “artifact” being deployed is governance and platform infrastructure, and the blast radius is potentially tenant-wide, so the controls are correspondingly stricter.

Why it matters

Because a single merged PR can change a Deny policy across every subscription, the platform pipeline is the control plane for your control plane. Done well it gives you peer-reviewed, auditable, gated, repeatable changes with a clear preview of impact and the ability to promote the same change through canary → production. Done badly (or not at all) you are back to admins clicking in the portal with no review, no preview, and no record — exactly the state IaC exists to eliminate.

The platform pipeline stages

A reference platform pipeline (whether GitHub Actions or Azure DevOps Pipelines) runs roughly:

Stage What runs Gate
Validate / lint terraform validate / bicep build, fmt/format checks, tflint, naming linter (Azure/naming) Block on failure
Static / policy test Checkov/tfsec/PSRule for Azure (well-architected + ALZ rules), secret scanning Block on findings
Plan / what-if terraform plan / az deployment ... what-if, output attached to the PR Human review of the diff
PR approval Required reviewers (CODEOWNERS), branch protection Approvals + green checks
Apply to canary Deploy to a non-prod MG/management-test scope or canary tenant Automated post-deploy checks
Apply to production Deploy to the real MG hierarchy / platform subs Environment protection / manual approval
Post-deploy verify Azure Resource Graph compliance queries, policy-compliance check Alert on drift

How to do it well

Concrete artifacts, decisions, and tools

GitOps and the platform team operating model

What it is

GitOps is the operating discipline where git is the single source of truth for desired state, all changes flow through pull requests, and an automated system continuously reconciles the live environment to what git says — with drift detected and (optionally) auto-corrected. The platform team operating model is the human side: a dedicated platform engineering team (often the Cloud Center of Excellence / CCoE) that builds and operates “the landing zone” as an internal product — versioned, documented, supported, with application teams as its customers. Together they turn the platform from a project into a product with a release cadence.

Why it matters

Two failure modes kill landing zones over time: drift (someone clicks a change in the portal and the declared state silently diverges) and ownership rot (the original build team disbands and no one owns the platform’s evolution). GitOps attacks drift by making git authoritative and reconciliation continuous. The platform-team operating model attacks ownership rot by making the landing zone someone’s product with a backlog, releases, and SLOs. The combination is what lets a 50-team enterprise keep a single coherent, compliant platform years after go-live — and it shifts the platform team from gatekeepers (manually approving every request) to enablers (publishing self-service capabilities behind guardrails).

The operating model — platform team vs application teams

Concern Platform team (CCoE) owns Application team owns
Management-group hierarchy & policy ✔ (as code, via the platform repo)
Connectivity hub, firewall, DNS Spoke usage within guardrails
Subscription vending capability ✔ (builds the vending product) Requests a subscription via PR
Guardrails (Azure Policy) ✔ (authors/assigns) Operates inside them
Their application + its resources ✔ (deploys into their own sub)
RBAC for their workload ✔ (scoped to their sub)
Golden modules / paved road ✔ (publishes AVM-based modules) ✔ (consumes them)

The principle: the platform team delegates with guardrails, not gatekeeps. App teams get Owner on their own subscription and a paved road of approved, reusable modules; the platform team ensures the guardrails they cannot remove are correct.

How to do it well

KPIs for the platform product

KPI What it tells you Healthy direction
Subscription vending lead time Self-service health Minutes, not weeks
Policy-change lead time (PR → enforced) Governance agility Hours/days
Platform deployment frequency How safely you can change the platform Frequent, small PRs
Change failure rate / MTTR Platform-change safety Low / fast
% platform changes via git (vs portal) GitOps adherence ~100%
Policy-compliance % across the estate Guardrails actually holding High and rising
Configuration-drift incidents detected Reconciliation working Detected fast, trending down
Paved-road module adoption Is the easy path the compliant path? Rising

Concrete artifacts, decisions, and tools

Real-world enterprise scenario

Aranya Logistics is a fictional pan-India freight and last-mile-delivery company — ~2,300 employees, 38 application teams, a data-residency obligation (India regions only for customer data), and a board mandate to move off a chaotic single-subscription Azure estate onto a governed platform within two quarters. Their newly-formed platform engineering team (six engineers, branded internally as the CCoE) owns the Platform Automation & DevOps design area end to end.

IaC with Bicep and Terraform. Aranya already runs Terraform for their on-prem and a small AWS analytics footprint, so they standardize the platform on Terraform + Azure Verified Modules, building the landing zone from Azure/avm-ptn-alz and vending from Azure/lz-vending. Application teams are not forced onto Terraform — guardrails are enforced by Policy, so app teams use Bicep or Terraform as they prefer inside their own subscriptions. Platform state lives in an Azure Storage account in the Management subscription with versioning, soft-delete, Entra-auth (no account keys), blob-lease locking, a customer-managed key, and a CanNotDelete lock. Every upstream module is pinned to an exact version; upgrades are deliberate PRs.

The landing-zone accelerator. They bootstrap with the Azure/alz Terraform accelerator, choosing the hub-and-spoke “complete” starter (matching their Network Topology decision: Azure Firewall + ExpressRoute hub, not Virtual WAN). The bootstrap phase (run once) creates the GitHub repos, the GitHub Actions workflows, the Terraform state backend, and a Microsoft Entra app with OIDC federated credentials — no client secrets anywhere. The platform phase then runs continuously via PRs. On top of the accelerator’s defaults, Aranya layers org specifics in a config overlay: two extra geo MG tiers (Corp-IN-Central, Corp-IN-South) carrying an Allowed locations policy pinned to centralindia/southindia, plus three custom policy initiatives. They stand the whole platform up first in a management-test MG branch and prove a clean plan before touching production MGs.

Subscription vending. They productize onboarding with lz-vending. An app team opens a pull request to the landing-zones repo containing a ~15-line YAML: app name, cost center, environment, criticality, residency (IN-Central/IN-South), budget, and connectivity. PR review by the platform team is the approval gate; merge triggers a GitHub Actions workflow that mints the subscription under their MCA invoice section, places it in Corp-IN-Central or Online, peers its spoke vNet to the regional hub, registers Private DNS, grants the app team Owner scoped to that subscription only (platform elevation stays behind Entra PIM), seeds four baseline RGs (-app/-data/-net/-shared), wires diagnostic settings to the central Log Analytics workspace, and sets a ₹-denominated Cost Management budget with action-group alerts at 60/90/100%. A previously three-week, ticket-driven subscription request now completes in under 18 minutes, fully compliant.

Platform CI/CD. All platform changes run through GitHub Actions authenticating via OIDC workload identity federation (zero stored secrets). Each layer has its own scoped deployment identity (MG/policy, connectivity, vending) — no tenant-root god-identity. The pipeline lints (terraform validate, tflint, Azure/naming), runs PSRule for Azure and Checkov as blocking gates, posts the terraform plan as a PR comment for human review, requires CODEOWNERS approval and branch protection on main, applies first to the management-test canary scope with post-deploy Azure Resource Graph checks, then applies to production behind a GitHub Environment manual approval. A nightly plan flags any out-of-band drift.

GitOps and the platform team operating model. Aranya runs the landing zone as an internal product: semantically versioned releases, a changelog, a backlog fed by app-team guardrail requests (submitted as PRs/issues), and published SLOs (vending < 30 min; policy-change lead time < 2 days). No portal changes to platform scopes are permitted — portal write access to platform MGs is locked down, so the pipeline is the only path; the nightly reconciliation alerts on drift and the team remediates via PR. They publish a paved road of AVM-based golden modules (a compliant App Service web app, an AKS baseline using the Flux AKS GitOps extension for workload reconciliation, and a data-platform module) so the easy path is the compliant path. A Power-BI-over-Resource-Graph KPI dashboard tracks vending lead time, policy-compliance %, % of changes via git, and drift incidents.

Measurable outcome after two quarters: 41 application landing zones vended (target 35); ~99% reduction in subscription provisioning lead time (3 weeks → 18 min); 100% of platform changes via git (portal writes to platform scopes disabled); 0 stored deployment secrets (all OIDC); policy-compliance across the new estate at 98% (live via Resource Graph); zero residency violations (the Allowed locations policy pinned at the geo MG tier is inherited everywhere beneath); and paved-road module adoption at 73% of new workloads — the landing zone now evolves through reviewed pull requests rather than tribal knowledge.

Deliverables & checklist

Common pitfalls

  1. Treating the accelerator as a one-shot installer. Running the portal accelerator once and walking away leaves you with a platform that immediately starts drifting and has no upgrade path. Keep the bootstrap-vs-platform distinction: the accelerator builds the factory; the platform must then be a continuous CI/CD loop you own, consuming upstream releases as versioned updates.
  2. Mixing Bicep and Terraform in the same layer without a reason. Each tool multiplies the skills you must hire for and the failure modes you must debug. Standardize one primary language for the platform; if app teams want a different tool inside their own subscription, that’s fine — guardrails are enforced by Policy, not by mandating their IaC tool.
  3. Storing pipeline secrets instead of using OIDC. A long-lived client secret in your CI system is a tenant-wide credential waiting to leak. Use Microsoft Entra workload identity federation (OIDC) so deployments use short-lived tokens with no stored secret — the accelerator wires this for you.
  4. One god deployment identity with Owner on the tenant root. A single over-privileged identity is a catastrophic blast radius and a compliance red flag. Use least-privilege, per-layer deployment identities scoped to the smallest MG/subscription each needs, with elevation via PIM where required.
  5. Manual portal changes to the platform (drift). The moment someone clicks a policy or MG change in the portal, git is no longer the truth and your reconciliation lies. Enforce GitOps — disable portal write access to platform scopes, route every change through a PR, and run scheduled drift detection to catch exceptions.
  6. Centralized, ticket-driven subscription creation. Hand-building subscriptions kills velocity and breeds shadow IT. Productize vending (Azure/lz-vending) so app teams self-serve via a small PR and receive a governed, connected, budgeted subscription in minutes — with no standing platform access.
  7. Building the platform but never running it as a product. Without an owning platform team, a backlog, versioned releases and SLOs, the landing zone decays into tribal knowledge the day the build team disbands. Run it as an internal product with a clear operating model and a paved road, measured by platform KPIs.

What’s next

This installment closes the eight Azure Landing Zone Design Areas — from here, the next part of the series steps back to the operating cadence that keeps the whole platform healthy: landing-zone lifecycle management and continuous compliance, turning the design areas you’ve now built into an ongoing, measurable practice.

AzureLanding ZonePlatform Automation & DevOpsEnterprise
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

// part 8 of 8 · Azure Landing Zone Design Areas

Keep Reading