Azure Landing Zone

Azure Enterprise-Scale Landing Zone: Foundation for Large Organizations

A global logistics company migrated to Azure region by region, team by team. Each squad invented its own naming, its own networking, its own idea of “secure.” Three years later they had forty-one subscriptions with overlapping 10.0.0.0/16 ranges that could never be peered, four different log destinations (and three teams with none), no central way to forbid public SQL, and a security team that found out about new internet-facing workloads from the threat-intel feed rather than from a change record. Re-IP-ing production to undo the address collisions took two quarters. None of this was a tooling failure — every team used Azure correctly in isolation. It was the absence of a foundation: a shared, opinionated scaffolding that every workload lands on so the basics are decided once, centrally, and inherited automatically.

That foundation is the Azure enterprise-scale landing zone (ESLZ) — Microsoft’s prescriptive architecture, part of the Cloud Adoption Framework (CAF), for running Azure at organizational scale. It is emphatically not “a hub VNet and some policies.” It is a complete operating model expressed as Azure resources: a management-group hierarchy that scopes policy and access top-down; a split between platform subscriptions (identity, management, connectivity) that the central team owns and application landing-zone subscriptions that workload teams own; a hub-and-spoke or Virtual WAN network topology with centralized egress and DNS; Azure Policy assignments that enforce the guardrails (allowed regions, required tags, deny public endpoints, force diagnostic settings) so compliance is automatic rather than audited-after-the-fact; centralized logging into a single Log Analytics workspace; and a subscription-vending process that hands a new team a fully-governed subscription in hours, not weeks. The point of all of it is subsidiarity: the platform team decides the things that must be consistent (security, connectivity, identity), and application teams move fast inside guardrails on everything else.

This article walks the architecture the way you would actually build and operate it. You will learn what each management-group tier is for and why aligning it to org charts is the classic mistake; the exact role of each platform subscription; how the connectivity hub centralizes the firewall, gateways, and private DNS; the difference between policy effects (Deny, Audit, DeployIfNotExists, Modify) and which guardrail uses which; how subscription vending works and what it provisions; and where landing zones go wrong (mis-scoped policy that blocks every deployment, a platform team that becomes a ticket queue, a hierarchy so deep nobody can reason about effective access). Every concept comes with the az and Bicep to implement it, the real limits that constrain the design, and — because this is a reference you will keep open during a platform build — the decisions, options, effects, and failure modes are laid out as scannable tables. AZ-305 and AZ-104 both test this material heavily; so does every architecture-review board you will ever sit in front of.

What problem this solves

Resource groups and a subscription get one team to production. They do not get fifty teams to production without the environment collapsing into entropy. The pains a landing zone exists to kill are specific and they all stem from decentralized defaults:

Networking that can never be joined. Independent teams pick 10.0.0.0/16 because the portal suggests it. Two such VNets can never peer (overlapping address space), so the moment two workloads must talk you are re-IP-ing production or bolting on NAT. A landing zone hands every spoke a non-overlapping CIDR from a planned IP plan and peers it to a hub on day one.

Governance you can only audit, never enforce. Without central policy, “no public storage accounts” is a wiki page that someone violates on a Friday. The breach is found in a quarterly review, weeks late. With policy assigned at a management group, a Deny effect makes the non-compliant PUT fail at the control plane — the bad config never exists.

Logs scattered or absent. Each team wires (or forgets) diagnostic settings to a workspace of its choosing. The SOC has no single place to hunt, and an incident spanning three teams means three queries against three schemas. A landing zone forces diagnostic settings to a central workspace via DeployIfNotExists policy, so coverage is automatic and complete.

Identity sprawl and over-privilege. Without a model, every team grants Owner at subscription scope “to be safe,” and nobody can answer “who can delete production?” A landing zone defines RBAC at management-group scope with least privilege, uses managed identities over secrets, and gates standing privilege behind PIM.

Onboarding measured in weeks. A new business unit asking for Azure waits while someone hand-builds a subscription, wires networking, sets up logging, and applies security. Multiply by every new team and the central team is the bottleneck. Subscription vending turns that into a templated, hours-long, fully-governed hand-off.

Who hits this: any organization past roughly five to ten subscriptions, anyone with regulated workloads (a Deny-non-compliant-regions guardrail is the cheapest path to data-residency compliance), and anyone on a multi-year cloud journey where the foundation must outlive the first three projects. Who does not: a startup with one subscription and one team — for them the ESLZ overhead is pure cost with no payoff, and “minimum viable landing zone” or nothing is the right call.

To frame the whole design before the deep dive, here is the foundation as five pillars, the pain each removes, and the Azure construct that delivers it:

Pillar Pain without it Delivered by Enforcement mechanism
Resource organization 40 ungoverned subscriptions, no hierarchy Management-group tree Inheritance of policy + RBAC top-down
Governance “No public SQL” is a wiki page Azure Policy assigned at MG scope Deny / Audit / DeployIfNotExists effects
Network topology Overlapping CIDRs, can’t peer Hub-and-spoke or Virtual WAN Connectivity subscription + IP plan + peering
Identity & access Everyone is Owner; no audit trail Entra ID + RBAC at MG scope + PIM Least-privilege role assignments, JIT elevation
Operations Logs scattered or missing Central Log Analytics + Defender DeployIfNotExists diagnostic-setting policy

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the Azure resource hierarchy — that resources live in resource groups, resource groups live in subscriptions, and subscriptions can be grouped under management groups — and that RBAC and Azure Policy both inherit downward through that hierarchy. If that hierarchy is fuzzy, read Azure Resource Hierarchy Explained first; it is the literal substrate this article builds on. You should be able to run az in Cloud Shell, read JSON output, and know what a managed identity and a service principal are. Familiarity with VNets, subnets, peering, and NSGs helps for the connectivity sections — Azure Virtual Network: Subnets & NSGs covers the fundamentals.

This sits at the very top of the Governance & Platform track. Everything else assembles inside it: Azure Policy: Governance at Scale is the enforcement engine the landing zone wires up; Hub-and-Spoke vs Virtual WAN is the connectivity decision the platform team makes once; Azure Monitor & Application Insights is the observability the central workspace feeds; and Azure FinOps & Cost Management at Scale is how you keep the whole estate’s bill sane once dozens of teams are vending subscriptions. A landing zone is the frame; those are the pictures you hang in it.

A quick map of who owns what in the operating model, so the responsibility boundaries are explicit before the design:

Layer What lives here Who owns it What application teams may NOT change
Tenant root / intermediate MG Org-wide policy, RBAC baseline Platform / cloud CoE The guardrail policies, the MG tree
Platform subscriptions Identity, mgmt, connectivity Platform team Everything — they have no access here
Connectivity hub Firewall, gateways, private DNS Network team Hub routing, firewall rules, DNS zones
Landing-zone MGs (Corp/Online) Workload guardrails Platform sets, teams inherit Inherited deny/audit policies
Application subscription The workload itself Application team Region allow-list, required tags, deny-public

You do not build all of this on day one. The sane build order — what to stand up first and what to defer until real demand appears — keeps a small org from drowning in the full reference architecture:

Build phase What you stand up When it’s enough What it defers
MVLZ (minimum viable) MG tree + core Deny guardrails + central Log Analytics ≤ ~10 subs, no hybrid, cloud-only The whole connectivity hub (firewall, gateways)
+ Connectivity Hub VNet, firewall, private DNS, spoke peering First workloads need central egress / private PaaS ExpressRoute, Virtual WAN, NVA chains
+ Hybrid VPN / ExpressRoute gateway, on-prem routes On-prem integration required Global any-to-any transit
+ Vending Templated subscription-vending module Onboarding cadence outpaces the platform team Per-team self-service portal
+ Scale-out Virtual WAN / secured hubs, more MGs, FinOps Many regions/branches, dozens of teams (this is the mature steady state)

Core concepts

Six mental models make every later decision obvious.

Inheritance is the whole point — and the whole danger. Both Azure Policy and RBAC flow downward from the management group to every child MG, subscription, resource group, and resource beneath it. Assign “deny public IP on NICs” at an intermediate MG and every subscription under it inherits the deny — that is the power. But the same mechanism means a too-strict policy at a high scope silently breaks deployments three levels down in subscriptions you have never looked at. You cannot un-inherit a policy at a child scope (you can only add an exemption for a specific resource/scope). The hierarchy is a contract: what you put high is law everywhere below.

Management groups scope governance, not organization. A management group is a container for subscriptions (and other MGs) whose only job is to be a scope for policy and RBAC inheritance. The classic, expensive mistake is to model your org chart (Marketing MG, Finance MG, EMEA MG). Org charts re-org; governance needs are stable. The enterprise-scale design instead models by governance requirement: a Platform branch (consistent platform services), and Landing Zones branches like Corp (private, on-prem-connected workloads) and Online (internet-facing workloads), because those two classes need genuinely different policy (Online forbids private-only routing; Corp forbids public exposure). Align to what must be governed differently, not who reports to whom.

Platform subscriptions are workload-free, by design. The central team runs three (or so) platform subscriptions that hold only shared services: Identity (domain controllers / Entra Domain Services, identity infra), Management (the central Log Analytics workspace, automation, monitoring), and Connectivity (the hub VNet, Azure Firewall, VPN/ExpressRoute gateways, private DNS zones). No business workload ever runs here. Keeping them workload-free means the blast radius of a workload incident never touches identity or connectivity, and the platform team can lock these subscriptions down hard.

The hub centralizes the things that must be shared. In hub-and-spoke, one hub VNet (in the Connectivity subscription) holds the resources every workload needs but nobody should duplicate: the firewall for centralized egress inspection, the gateways for hybrid connectivity, Azure Bastion for jump-box access, and private DNS zones for private-endpoint resolution. Each workload’s spoke VNet peers to the hub and routes egress through the firewall via user-defined routes (UDRs). The alternative, Virtual WAN, is a Microsoft-managed hub that does the same job with less plumbing at the cost of less control — the choice gets its own section.

Policy effects are a spectrum from “watch” to “block” to “fix.” An Azure Policy doesn’t just flag — its effect decides what happens to a non-compliant request or resource. Audit records non-compliance (no block). Deny rejects the create/update at the control plane. DeployIfNotExists remediates by deploying a missing resource (e.g. a diagnostic setting) using a managed identity. Modify mutates the request (e.g. adds a required tag). Picking the wrong effect is the difference between a guardrail that prevents incidents and one that merely logs them after the fact.

Subscription vending turns onboarding into a template. Rather than hand-build each subscription, the landing zone uses subscription vending — a templated process (Bicep/Terraform module, or the Azure Landing Zone accelerator) that creates a subscription, places it under the correct management group (so it inherits the right guardrails instantly), peers its spoke to the hub, assigns budgets and tags, and wires diagnostic settings. A new team goes from request to a fully-governed, network-connected, policy-compliant subscription in hours.

The vocabulary in one table

Before the deep sections, pin every moving part. The glossary repeats these for lookup; this is the model side by side:

Term One-line definition Where it lives Why it matters
Management group (MG) Container scoping policy + RBAC inheritance Above subscriptions The unit of governance; mis-modeling it is the core mistake
Tenant root group The single MG at the top of every directory Top of the tree Assign org-wide guardrails here (sparingly)
Platform subscription Workload-free sub for shared services Under the Platform MG Isolates identity/connectivity/mgmt from workloads
Landing-zone subscription A workload’s home subscription Under Corp/Online MG Where application teams actually build
Hub VNet Shared network with firewall/gateways/DNS Connectivity subscription Centralizes egress, hybrid, DNS
Spoke VNet A workload’s VNet, peered to the hub Application subscription Where the workload’s compute/data lives
Azure Policy Rule evaluated on resources/requests Assigned at MG/sub/RG scope The guardrail enforcement engine
Policy initiative (set) A bundle of policies assigned together Assigned at a scope How you apply dozens of guardrails as one unit
Policy effect What happens on non-compliance Inside a policy definition Deny/Audit/DeployIfNotExists/Modify
Subscription vending Templated subscription provisioning Bicep/Terraform/accelerator Turns onboarding from weeks to hours
CAF / ESLZ Microsoft’s adoption framework + reference arch Guidance + accelerator The blueprint this whole article implements
Management group level Depth in the MG tree (max 6 below root) The hierarchy A hard limit that disciplines tree depth

The management-group hierarchy

The hierarchy is the spine of the landing zone. Get it right and governance is effortless; get it wrong and you are re-parenting subscriptions and re-scoping policy for years. The enterprise-scale reference tree is deliberate, and every node earns its place.

The reference tree, tier by tier

Under the Tenant Root Group (which always exists, one per Entra tenant), the ESLZ creates a single intermediate root (often named for the company, e.g. contoso) so that org-wide policy lives one level below the true root — keeping the tenant root itself clean and letting you test changes without touching the absolute top. Beneath the intermediate root sit the major branches. Here is each node, what it is for, and what you assign there:

MG tier Example name Purpose Typical policy assigned here
Tenant Root Group (tenant root) The directory’s absolute top Almost nothing — keep it clean
Intermediate root contoso Org-wide guardrails, one level down Allowed regions, required tags, deny classic, audit baseline
Platform contoso-platform Shared-service subscriptions Stricter network + diagnostic policy for platform
Platform → Identity contoso-identity Identity infra subscription Lock-down, no public exposure
Platform → Management contoso-management Central logging/monitoring Force diagnostics to central workspace
Platform → Connectivity contoso-connectivity Hub, firewall, gateways, DNS Network guardrails, deny untrusted peering
Landing Zones contoso-landingzones Parent of all workload MGs Workload baseline guardrails
Landing Zones → Corp contoso-corp Private, on-prem-connected workloads Deny public inbound, require private endpoints
Landing Zones → Online contoso-online Internet-facing workloads Allow public, require WAF/Front Door, DDoS
Sandbox contoso-sandbox Experimentation, relaxed Loose policy, hard cost caps, no prod connectivity
Decommissioned contoso-decommissioned Subs being retired Deny new resources, prep for deletion

The split that confuses people most is Corp vs Online, so make it concrete: Corp workloads are reached only via private connectivity (the corporate network, ExpressRoute/VPN, private endpoints) and must never expose a public endpoint — so Corp carries a Deny on public IPs. Online workloads are meant to face the internet (a public website, a customer API) — so Online allows public exposure but Denys anything internet-facing that isn’t behind WAF/Front Door, and requires DDoS protection. Same parent, opposite guardrails. That is exactly why governance, not org chart, defines the tree.

What inherits, and the order it merges

Two subscriptions in different branches get genuinely different rule sets even though both descend from the intermediate root. The inheritance math:

Scope What it contributes Override behavior
Intermediate root Org-wide baseline (regions, tags) Most restrictive wins; Deny cannot be loosened below
Branch MG (Platform / Landing Zones) Branch-specific guardrails Adds to parent; cannot remove parent’s Deny
Leaf MG (Corp / Online) Class-specific rules Adds to parent; exemptions are the only escape
Subscription Sub-scoped assignments Adds further; still cannot un-inherit
Resource group / resource Finest scope The accumulation of everything above applies

The rule to memorize: policy is additive and Deny is sticky. A child scope can make things stricter but never looser — the only way to relax a specific resource is a policy exemption (scoped, time-boxed, audited), not a contrary assignment. RBAC inherits the same way (an assignment high up grants access everywhere below), which is why you grant narrow roles at the lowest scope that works.

Creating the tree with az and Bicep

Build the skeleton with the CLI:

# Intermediate root under the tenant root, then the major branches
az account management-group create --name contoso --display-name "Contoso"
az account management-group create --name contoso-platform \
  --display-name "Platform" --parent contoso
az account management-group create --name contoso-landingzones \
  --display-name "Landing Zones" --parent contoso
az account management-group create --name contoso-corp \
  --display-name "Corp" --parent contoso-landingzones
az account management-group create --name contoso-online \
  --display-name "Online" --parent contoso-landingzones
az account management-group create --name contoso-sandbox \
  --display-name "Sandbox" --parent contoso

Move an existing subscription under the right MG (this is also what vending automates):

az account management-group subscription add \
  --name contoso-corp \
  --subscription "00000000-0000-0000-0000-000000000000"

Declaratively, the accelerator models the whole tree as code so it is reviewable and reproducible:

// Management groups are tenant-scoped; deploy at 'tenant' scope.
targetScope = 'tenant'

resource intermediate 'Microsoft.Management/managementGroups@2023-04-01' = {
  name: 'contoso'
  properties: { displayName: 'Contoso' }
}

resource platform 'Microsoft.Management/managementGroups@2023-04-01' = {
  name: 'contoso-platform'
  properties: {
    displayName: 'Platform'
    details: { parent: { id: intermediate.id } }
  }
}

resource landingZones 'Microsoft.Management/managementGroups@2023-04-01' = {
  name: 'contoso-landingzones'
  properties: {
    displayName: 'Landing Zones'
    details: { parent: { id: intermediate.id } }
  }
}

The limits that discipline the tree

The hierarchy has hard ceilings, and they are features — they stop you building a tree nobody can reason about. Know them before you design:

Limit Value Why it exists / what it forces
Management groups per Entra directory ~10,000 Plenty; you will use dozens, not thousands
MG hierarchy depth (below root) 6 levels Forces a flat, comprehensible tree
Subscriptions per management group No hard cap (practical limits apply) Group freely; governance, not count, drives structure
Direct children (MGs + subs) per MG ~10,000 Effectively unlimited for real designs
MG a subscription can belong to at once Exactly 1 A sub has exactly one governance parent
Levels of policy/RBAC inheritance Every level down to the resource The deeper the tree, the more accumulates
Time for a new MG assignment to propagate Minutes (eventual) Don’t expect instant enforcement on create

The depth limit of six is the design constraint that matters most: a tree deeper than three or four working levels (intermediate root → branch → class → maybe one more) is almost always modeling the org chart and should be flattened.

Platform subscriptions: the shared core

The Platform branch holds the subscriptions the central team owns and workloads never touch. Three is the canonical set; very large estates split further. Each exists to isolate a concern so its blast radius is contained.

Identity subscription

Holds identity infrastructure that workloads depend on but must never co-locate with: domain controllers or Entra Domain Services, identity-sync servers, and any PKI/certificate infrastructure. Locked down hard — no public inbound, strict RBAC, full diagnostic coverage. The reason it is separate: an identity outage or compromise is catastrophic, so it gets the tightest controls and the smallest set of admins.

Management subscription

The observability and automation core: the central Log Analytics workspace every subscription’s diagnostic settings point at, Azure Automation, Azure Monitor alerting, Microsoft Sentinel if you run a SIEM, and the Defender for Cloud configuration. Centralizing the workspace here is what makes “one place to hunt across the whole estate” true. The DeployIfNotExists diagnostic-setting policies (below) all target this workspace.

Connectivity subscription

The network heart: the hub VNet, Azure Firewall, VPN/ExpressRoute gateways, Azure Bastion, DDoS protection plan, and the private DNS zones for private-endpoint resolution. Every spoke peers here; all egress routes through the firewall here. It is the single most operationally sensitive platform subscription because a misconfiguration takes down connectivity for every workload at once.

The three platform subscriptions side by side — what each holds, what it protects against, and the dominant guardrail:

Platform subscription Key resources Isolates / protects Dominant guardrail
Identity DC / Entra DS, identity sync, PKI Identity blast radius No public inbound; tight RBAC
Management Central Log Analytics, Automation, Sentinel, Defender Observability continuity Force diagnostics here; restrict workspace access
Connectivity Hub VNet, Firewall, gateways, Bastion, DDoS, private DNS Network blast radius Deny untrusted peering; central egress

Why workloads never run in platform subscriptions — the rule and its three reasons, as a table you can quote in a design review:

Reason What goes wrong if you ignore it
Blast radius A workload bug/incident can now take down identity or connectivity for everyone
Cost attribution Platform spend mixes with workload spend; nobody can chargeback cleanly
Access scoping Workload teams need access to “their” sub — granting it here exposes the shared core

Hub-and-spoke connectivity

The network is where overlapping-CIDR pain originated and where the landing zone earns its keep. The Connectivity subscription holds the hub; every workload gets a spoke that peers to it.

Anatomy of the hub

The hub VNet (sized generously — a /22 or larger to fit the gateway, firewall, and Bastion subnets) carries the resources every workload shares:

Hub component Subnet Purpose Note / limit
Azure Firewall AzureFirewallSubnet (≥ /26) Centralized egress inspection + FQDN filtering Subnet name is fixed; needs a /26 minimum
VPN / ExpressRoute gateway GatewaySubnet (≥ /27, /26 if both) Hybrid connectivity to on-prem Subnet name fixed; one gateway of each type
Azure Bastion AzureBastionSubnet (≥ /26) Browser-based RDP/SSH to spokes Subnet name fixed; no public IP on VMs needed
DDoS protection plan (VNet-level) L3/L4 volumetric protection One plan, shared by all protected VNets
Private DNS zones (no subnet; zone resources) Resolve private-endpoint FQDNs Linked to spokes for resolution
Azure Route Server (optional) RouteServerSubnet (≥ /27) BGP route exchange with NVAs Only if you run third-party NVAs

Spoke peering and forced tunneling

Each spoke VNet peers to the hub with allowForwardedTraffic and (for the spoke→hub link) useRemoteGateways so the spoke uses the hub’s gateway rather than its own. Egress is forced through the firewall with a user-defined route (UDR) sending 0.0.0.0/0 to the firewall’s private IP. The mechanics, peering option by option:

Peering setting On which link Set to Why
allowVirtualNetworkAccess Both true Lets the peered VNets reach each other
allowForwardedTraffic Hub→spoke (and spoke→hub) true Allows traffic that transited the firewall/NVA
allowGatewayTransit Hub→spoke true Hub shares its gateway with spokes
useRemoteGateways Spoke→hub true Spoke uses the hub’s gateway, not its own
UDR 0.0.0.0/0 → firewall Spoke route table firewall private IP Forces all egress through central inspection

Wire a spoke to the hub and force egress through the firewall:

# Peer spoke -> hub (use the hub's gateway), then hub -> spoke (share the gateway)
az network vnet peering create -g rg-spoke -n spoke-to-hub \
  --vnet-name vnet-spoke-app --remote-vnet vnet-hub \
  --allow-vnet-access --allow-forwarded-traffic --use-remote-gateways
az network vnet peering create -g rg-connectivity -n hub-to-spoke \
  --vnet-name vnet-hub --remote-vnet vnet-spoke-app \
  --allow-vnet-access --allow-forwarded-traffic --allow-gateway-transit

# Force all spoke egress through the firewall
az network route-table route create -g rg-spoke --route-table-name rt-spoke \
  -n default-to-fw --address-prefix 0.0.0.0/0 \
  --next-hop-type VirtualAppliance --next-hop-ip-address 10.0.1.4
resource peeringToHub 'Microsoft.Network/virtualNetworks/virtualNetworkPeerings@2023-09-01' = {
  parent: spokeVnet
  name: 'spoke-to-hub'
  properties: {
    remoteVirtualNetwork: { id: hubVnet.id }
    allowVirtualNetworkAccess: true
    allowForwardedTraffic: true
    useRemoteGateways: true     // consume the hub's gateway
  }
}

Private DNS in the hub

Private endpoints only work if their FQDNs resolve to private IPs, and that resolution must be centralized in the hub or every spoke re-invents it (and drifts). The hub holds the private DNS zones, linked to each spoke VNet, and a DeployIfNotExists policy auto-creates the zone group on new private endpoints. The zones you’ll actually host — each PaaS service has a fixed zone name you cannot rename:

PaaS target Private DNS zone name Resolves
Blob storage privatelink.blob.core.windows.net Storage account blob endpoint
Key Vault privatelink.vaultcore.azure.net Vault secret/key/cert endpoint
Azure SQL Database privatelink.database.windows.net SQL server private endpoint
App Service / Functions privatelink.azurewebsites.net Web app private endpoint
Cosmos DB (SQL API) privatelink.documents.azure.com Cosmos account endpoint
Container Registry privatelink.azurecr.io ACR private endpoint
Service Bus / Event Hubs privatelink.servicebus.windows.net Messaging private endpoint

The discipline: host these zones once in the connectivity hub, link them to every spoke, and let policy attach them to new private endpoints automatically — so resolution is consistent estate-wide and no spoke runs its own conflicting copy. The deep treatment of private DNS at scale (private resolver vs zones) is in Azure Private Link & Private DNS for PaaS.

Hub-and-spoke vs Virtual WAN

The platform team makes this call once. Traditional hub-and-spoke is a VNet you manage (full control, you own peering and routing). Virtual WAN is a Microsoft-managed hub that handles peering, routing, and branch connectivity for you. The trade-off:

Dimension Hub-and-spoke (self-managed) Virtual WAN (Microsoft-managed)
Hub management You build/operate the hub VNet Microsoft manages the hub
Routing You write UDRs and manage transit Managed routing, automatic transit
Branch/site-to-site at scale Manual per-connection Built for many branches/VPN at scale
Control / customizability Maximum (your VNet, your rules) Less — you work within the managed model
Global transit (region-to-region) You build it (peering + routing) Built-in any-to-any across regions
Best for A handful of regions, high control needs Many branches, global mesh, less plumbing
Cost model VNet + firewall + gateway you run Per-hub + per-connection + data

The decision rule as a table — match your situation to the topology:

If you have… Lean toward
1–3 regions, strong networking team, need full control Hub-and-spoke
Many global branches / lots of site-to-site VPN Virtual WAN
Region-to-region any-to-any transit as a baseline need Virtual WAN
Heavy custom routing / third-party NVA chains Hub-and-spoke (more control)
Want least operational plumbing, accept the managed model Virtual WAN

The deeper treatment of this exact decision — including Virtual WAN routing intent and secured hubs — is in Hub-and-Spoke vs Virtual WAN: Enterprise Topology; the landing zone simply requires that you make it deliberately and centralize egress either way.

Governance guardrails with Azure Policy

Policy is what turns “we have standards” into “the platform enforces standards.” The landing zone assigns initiatives (bundles of policies) at management-group scopes so every subscription beneath inherits them.

Policy effects — the full spectrum

The effect is the most important field in a policy: it decides what actually happens. Choosing it wrong is the difference between prevention and a useless log entry. Every effect, what it does, and the guardrail it suits:

Effect What it does Blocks the request? Remediates? Typical guardrail use
Deny Rejects a non-compliant create/update Yes No Forbid public IPs in Corp; forbid disallowed regions
Audit Records non-compliance, allows it No No Visibility-only baseline before you enforce
AuditIfNotExists Audits when a related resource is missing No No “VM has no monitoring agent” — audit gap
DeployIfNotExists (DINE) Deploys the missing related resource No Yes (via managed identity) Auto-create diagnostic settings → central workspace
Modify Mutates the request (add/replace properties) No (alters) At-create + remediate Add a required tag; set httpsOnly: true
Append Adds fields to a resource at create No (alters) No Append an IP rule, a setting
Manual Marks compliance set by an attestation No No Controls you verify out-of-band
Disabled Turns the policy off No No Temporarily silence without unassigning

Two of these have a subtlety that bites in production: DeployIfNotExists and Modify both require a managed identity on the assignment, and that identity must hold the right RBAC role (e.g. Contributor on the target) or the remediation silently does nothing — the policy shows non-compliant forever and nobody knows why. This is the single most common landing-zone policy failure; the troubleshooting section walks the fix.

The guardrails every landing zone ships

The accelerator assigns a set of initiatives. The canonical ones, what they enforce, and the effect they use:

Guardrail Enforces Effect Assigned at
Allowed locations Resources only in approved regions (data residency) Deny Intermediate root
Allowed locations for RGs Resource groups only in approved regions Deny Intermediate root
Require a tag (e.g. CostCenter) Mandatory tags for chargeback Modify / Deny Intermediate root
Deny classic resources No legacy ASM resources Deny Intermediate root
Deploy diagnostic settings Stream logs to the central workspace DeployIfNotExists Management / all
Deny public IP on NICs (Corp) No internet-facing workloads in Corp Deny Corp MG
Require private endpoints (Corp) PaaS reached privately only Deny / Audit Corp MG
Require WAF / DDoS (Online) Internet-facing apps protected AuditDeny Online MG
Allowed VM SKUs Cost/standardization control Deny Landing Zones / sandbox
Enforce HTTPS-only on App Service/Storage No cleartext endpoints Modify / Deny Intermediate root
Deploy Defender for Cloud Threat protection everywhere DeployIfNotExists Intermediate root

Assigning an initiative at a management group

Assign the built-in “Allowed locations” guardrail at the intermediate root so the whole org inherits it:

# Find the policy, then assign it at the management-group scope with allowed regions
MG_ID=$(az account management-group show -n contoso --query id -o tsv)
az policy assignment create \
  --name "allowed-locations" \
  --display-name "Allowed locations (India only)" \
  --scope "$MG_ID" \
  --policy "e56962a6-4747-49cd-b67b-bf8b01975c4c" \
  --params '{ "listOfAllowedLocations": { "value": ["centralindia","southindia"] } }'

A DeployIfNotExists assignment needs an identity and a role — this is the part people forget:

# DINE/Modify assignments need a managed identity AND a role, or remediation no-ops
az policy assignment create \
  --name "deploy-diag-to-central-law" \
  --scope "$MG_ID" \
  --policy "<dine-policy-definition-id>" \
  --params '{ "logAnalytics": { "value": "<central-workspace-resource-id>" } }' \
  --mi-system-assigned --location centralindia \
  --role "Contributor" --identity-scope "$MG_ID"

# Then trigger remediation for existing non-compliant resources
az policy remediation create --name "remediate-diag" \
  --policy-assignment "deploy-diag-to-central-law" --management-group contoso
// Initiative (policy set) assigned at a management group, with a remediation identity
resource assignment 'Microsoft.Authorization/policyAssignments@2024-04-01' = {
  name: 'lz-guardrails'
  location: 'centralindia'
  identity: { type: 'SystemAssigned' }   // required for DINE/Modify remediation
  properties: {
    policyDefinitionId: tenantResourceId('Microsoft.Authorization/policySetDefinitions', initiativeName)
    scope: managementGroup().id
    parameters: {
      listOfAllowedLocations: { value: [ 'centralindia', 'southindia' ] }
    }
  }
}

Policy limits that shape your design

Policy has caps; large initiatives bump into them. Know them before you assemble a 200-policy mega-initiative:

Policy limit Value Design consequence
Policy definitions per location (tenant/MG/sub) 500 per scope Reuse built-ins; don’t author needlessly
Policy set (initiative) definitions per scope 200 Split mega-initiatives into themed sets
Policy assignments per scope 200 Bundle into initiatives rather than many assignments
Policies in a single initiative ~1,000 Big initiatives are fine; assignments are the tighter cap
Exemptions per scope 1,000 Exemptions are cheap; mis-scoped policy is not
Parameters per policy definition 100 Parameterize, but keep definitions focused
Compliance evaluation cadence ~24h (or on-demand scan) Don’t expect instant compliance state after a change

The deeper, effect-by-effect treatment — including how auditIfNotExists and remediation tasks actually evaluate — lives in Azure Policy: Governance at Scale; here, policy is the enforcement layer the landing zone wires into the hierarchy.

Identity and access at scale

A landing zone defines who can do what, where using RBAC at management-group scope with least privilege, and gates standing privilege behind just-in-time elevation. The model:

Principle Implementation Why
RBAC at MG scope Assign roles at the lowest MG that works, not per-subscription One assignment governs a whole branch; less sprawl
Least privilege Specific built-in roles (e.g. Reader, Network Contributor), not Owner Limits blast radius of a compromised identity
Managed identities over secrets Workloads use system/user-assigned MIs No secrets to leak or rotate
Just-in-time admin PIM for privileged roles — activate, don’t hold No standing global admin; every elevation is logged
Custom roles where built-ins don’t fit Define narrow custom roles Avoid granting broad roles to cover one gap
Break-glass accounts 2+ excluded emergency accounts Recover if Conditional Access / PIM locks everyone out

The role-scope decision as a table — pick the lowest scope that satisfies the need:

The principal needs to… Assign role at Example role
Read everything for audit/cost across the org Intermediate root MG Reader
Manage networking across all workloads Connectivity sub / Landing Zones MG Network Contributor
Build inside their own workload Their application subscription Contributor (sub-scoped)
Manage policy guardrails Intermediate root MG Resource Policy Contributor
Operate the central workspace Management subscription Log Analytics Contributor
Emergency full control Excluded from CA; PIM-eligible Owner (break-glass only)

The identity deep-dives — Conditional Access personas, PIM/PAM architecture, managed-identity federation — are their own track; for the landing zone, the rule is narrow roles at the right scope, no standing privilege, identity in its own platform subscription.

Subscription vending

This is where the landing zone pays off operationally: turning “give my team Azure” from a multi-week project into a templated hand-off. Subscription vending is a module (Bicep/Terraform, or the accelerator’s pipeline) that, given a few inputs (workload name, environment, network size, cost center, target landing zone), provisions a fully-governed subscription.

What a vend actually does, step by step:

Step What it provisions Result
1. Create subscription New subscription under the billing account A billable, empty subscription exists
2. Place under the correct MG Move it under Corp / Online / Sandbox Inherits the right guardrails instantly
3. Apply tags + budget CostCenter, Environment, an Azure budget Chargeback + spend alerts wired
4. Create the spoke VNet A /24 (or sized) from the IP plan Non-overlapping address space, by design
5. Peer to the hub Spoke↔hub peering + UDR to firewall Connected + egress inspected on day one
6. Wire diagnostics Diagnostic settings → central workspace (often via DINE) Logging coverage automatic
7. Assign RBAC The team gets Contributor at their sub scope They can build; can’t touch the platform
8. Hand off Output the subscription ID + connection details Team is productive in hours

The inputs a vending module typically takes, and the governance each guarantees:

Input Example Governance it enforces
workloadName payments Naming consistency, tagging
environment prod / nonprod Right MG, right budget, right policy
landingZoneType corp / online Correct guardrail set inherited
networkAddressSpace 10.42.0.0/24 Non-overlapping CIDR from the IP plan
costCenter CC-1180 Chargeback tag, budget owner
budgetAmount ₹150,000/mo Spend alert + cap

Vending also enforces a naming convention so the estate stays legible — a resource named vnet-payments-prod-cin tells you the type, workload, environment, and region at a glance. A consistent scheme (component abbreviation, workload, environment, region) is part of what makes governance auditable; bake it into the vend so no team can deviate:

Resource type Abbrev Example name Pattern
Subscription sub sub-payments-prod sub-<workload>-<env>
Resource group rg rg-payments-prod-cin rg-<workload>-<env>-<region>
Virtual network vnet vnet-payments-prod-cin vnet-<workload>-<env>-<region>
Subnet snet snet-app-payments-prod snet-<tier>-<workload>-<env>
Network security group nsg nsg-app-payments-prod nsg-<tier>-<workload>-<env>
Key Vault kv kv-payments-prod-cin kv-<workload>-<env>-<region> (≤24 chars)
Log Analytics workspace law law-platform-mgmt-cin law-<scope>-<purpose>-<region>

A minimal vend in az (the accelerator does far more, but this is the shape):

# 1) Create the subscription under a billing account (alias API)
az account alias create --name "sub-payments-prod" \
  --billing-scope "/providers/Microsoft.Billing/billingAccounts/<acct>/billingProfiles/<profile>/invoiceSections/<section>" \
  --display-name "Payments Prod" --workload Production

# 2) Place it under the Corp landing-zone MG so it inherits the guardrails
SUB_ID=$(az account alias show --name "sub-payments-prod" --query properties.subscriptionId -o tsv)
az account management-group subscription add --name contoso-corp --subscription "$SUB_ID"

# 3) Tag + budget
az consumption budget create --budget-name "payments-prod" --amount 150000 \
  --time-grain Monthly --category Cost --subscription "$SUB_ID"

The mature path is the Azure Landing Zone accelerator (ALZ Bicep / Terraform modules), which encodes the entire tree, the policy initiatives, the connectivity hub, and the vending module as reviewable infrastructure-as-code — see Infrastructure as Code 101: Your First Terraform on Azure for the IaC fundamentals that make this maintainable.

Architecture at a glance

The diagram traces the landing zone the way governance and traffic actually flow through it — top-down for control, left-to-right for the request path. Read it as four zones. On the left, the governance spine: the Tenant Root and the intermediate management group where org-wide guardrails (allowed regions, required tags, deny-classic) are assigned and from which Azure Policy and RBAC inherit downward into everything else — this is the control plane, and the numbered badge there marks the failure that bites hardest, a mis-scoped Deny that blocks deployments estate-wide. Next, the Platform zone holds the three workload-free subscriptions — Identity (Entra/DC infra), Management (the central Log Analytics workspace every diagnostic setting targets), and Connectivity (the hub) — owned by the central team. The third zone is the Connectivity hub itself: Azure Firewall for centralized egress inspection, the VPN/ExpressRoute gateway for hybrid, and private DNS for private-endpoint resolution; a badge here marks the DeployIfNotExists remediation-identity failure (logs silently never flow) and another marks forced-tunneling/peering mistakes.

On the right sit the Landing Zones — the Corp spoke (private, Deny public, peered to the hub and routing egress through the firewall) and the Online spoke (internet-facing, behind WAF/DDoS) — each a vended application subscription with its own non-overlapping spoke VNet. Follow the arrows: governance inherits down from the intermediate MG into Platform and Landing Zones; workload egress flows out from each spoke through the hub firewall; diagnostics flow back from every subscription into the central workspace in Management. The whole method is in that picture — decide the guardrails once at the top, vend governed subscriptions into Corp or Online, peer them to the hub, and let policy and logging apply themselves. The badges and the legend beneath the diagram narrate the four failures that turn a clean landing zone into an incident, with the confirm-and-fix for each.

Azure enterprise-scale landing zone architecture: a governance spine of the Tenant Root and an intermediate management group assigning org-wide Azure Policy and RBAC that inherit downward; a Platform branch with workload-free Identity, Management (central Log Analytics), and Connectivity subscriptions; a connectivity hub holding Azure Firewall for centralized egress, a VPN/ExpressRoute gateway for hybrid, and private DNS; and Landing Zones with a Corp spoke (private, deny-public, peered to the hub) and an Online spoke (internet-facing behind WAF and DDoS) — governance inheriting down, workload egress flowing out through the firewall, and diagnostics flowing back to the central workspace, with numbered badges marking a mis-scoped Deny that blocks estate-wide deployments, a DeployIfNotExists remediation-identity failure, forced-tunneling peering mistakes, and overlapping-CIDR spokes that cannot peer

Real-world scenario

Meridian Freight is the global logistics company from the opening — 2,400 employees, operations across India, the EU, and North America, and the forty-one-subscription sprawl that took two quarters to partially untangle. Their Azure spend was about ₹2.1 crore/month across those subscriptions, with no central cost view. The brief from the new Head of Cloud was blunt: “Stop the bleeding, then build a foundation we can grow on for five years.” Here is how the landing zone went in, what nearly went wrong, and how it was resolved.

The starting mess. Three of the forty-one subscriptions had production VNets on 10.0.0.0/16; two of those needed to talk to each other for a new track-and-trace integration, and could not peer. Logs landed in four different Log Analytics workspaces and three teams had none at all, so a credential-stuffing incident the prior year had taken eleven days to scope. Seven subscriptions had public-facing SQL databases that nobody had sanctioned. The security team learned of new internet-facing apps from external scans.

The design. The platform team (deliberately kept small — five engineers — but empowered) adopted the Azure Landing Zone accelerator on Terraform. They built the management-group tree: an intermediate root meridian, a Platform branch with Identity, Management, and Connectivity subscriptions, and a Landing Zones branch split into Corp (their warehouse, ERP, and on-prem-connected workloads) and Online (the customer tracking portal and partner APIs). They stood up a hub-and-spoke in Connectivity — they chose hub-and-spoke over Virtual WAN because they had a strong networking team and only three regions, and wanted full control over the firewall rule base. A planned IP plan carved 10.100.0.0/14 into per-spoke /24s so no future workload could collide again.

The guardrails. At the intermediate root: Allowed locations (Deny, India/EU/US regions only — a data-residency requirement for EU freight data), required CostCenter tag (Modify), deny classic resources, and deploy diagnostic settings to the central workspace (DeployIfNotExists). On Corp: deny public IP on NICs — which would have made all seven rogue public SQL databases impossible. On Online: require WAF and DDoS for anything internet-facing.

What nearly went wrong. Two weeks in, the platform team rolled the Deny public-IP policy to Corp and also, by mistake, scoped a draft “deny all public IPs” assignment at the intermediate root instead of Corp. Within an hour, three application teams reported that every deployment was failing — including legitimate Online workloads that needed public IPs, and even the Connectivity team’s own gateway deployment. The blast radius was the whole estate, exactly because policy inherits down from the root. The on-call platform engineer’s first instinct was to delete the policy definition; the right move was faster and surgical: identify the over-scoped assignment (not the definition), and remove it at the root, leaving the correctly-scoped Corp assignment intact.

The diagnosis. They confirmed it in two commands. az policy assignment list --scope <intermediate-root-MG> -o table showed the rogue deny-all-public-ip assignment at the root. az policy state list --filter "complianceState eq 'NonCompliant'" showed a flood of denied deployments tied to that assignment’s definition. The fix was to delete the assignment at the root (az policy assignment delete --name deny-all-public-ip --scope <root-MG>) — keeping the intended Corp-scoped one — and, for the two legitimate Online deployments that had been blocked mid-flight, nothing more was needed once the root assignment was gone. Deployments recovered within the policy-propagation window (a few minutes).

The second near-miss. The DeployIfNotExists diagnostic-settings policy showed every resource as non-compliant a day after assignment, and no logs were flowing to the central workspace. The cause was the classic one: the assignment’s managed identity had no role on the target scope, so remediation silently no-oped. az role assignment list --assignee <assignment-principal-id> returned empty. They granted the identity Contributor at the intermediate root and ran az policy remediation create; within the hour the central workspace was ingesting from all subscriptions.

The outcome. Within ten weeks, all forty-one legacy subscriptions were either re-parented under the new tree or scheduled into Decommissioned. New workloads were vended — the partner-API team went from request to a governed, hub-connected, logging-wired subscription in under four hours, versus the three weeks the last team had waited. Central cost visibility (one view across the estate) surfaced ₹34 lakh/month of idle and orphaned resources, which FinOps then reclaimed. The credential-stuffing-class incident that took eleven days to scope would now be a single KQL query against one workspace. The lesson the Head of Cloud put on the wall: “A landing zone is not a network diagram. It is the decision about what gets decided once.”

The incident-and-build as a timeline, because the order of moves is the lesson:

Week / moment Action Effect What it should have been
W0 Adopt ALZ accelerator (Terraform) Tree + policy as reviewable code
W1 Build MG tree, platform subs, hub Foundation exists
W2 Roll deny-public-IP… scoped at root by mistake Every deployment estate-wide fails Scope it at Corp, not root
W2 +1h First instinct: delete the policy definition Would orphan the correct Corp assignment too Delete the over-scoped assignment
W2 +90m az policy assignment list finds rogue root assignment Root cause localized The right diagnostic
W2 +2h Delete root assignment; keep Corp one Deployments recover in minutes Correct fix
W3 DINE diagnostics shows all non-compliant, no logs Remediation silently no-oped Identity needs a role
W3 +1h Grant MI Contributor, run remediation Central workspace ingests all subs The fix nobody documents
W4–W10 Re-parent/decommission 41 legacy subs; vend new 4-hour onboarding; ₹34L/mo reclaimed The payoff

Advantages and disadvantages

The landing-zone model both enables scale and imposes discipline. Weigh it honestly before committing an organization to it:

Advantages (why it pays off) Disadvantages (why it bites)
Every workload starts from the same secure, connected, logged baseline — no team rebuilds the basics Heavy up-front design; getting the MG tree or IP plan wrong is expensive to undo (re-parenting, re-IP-ing)
Governance applies automatically to hundreds of subscriptions via policy inheritance — enforce, don’t audit Policy inherits down: a mis-scoped Deny at a high MG can break deployments across the entire estate
Onboarding drops from weeks to hours via subscription vending A too-small platform team becomes a ticket queue, and the foundation that was meant to unblock teams now blocks them
Central logging + Defender give one place to hunt across the whole estate Centralization concentrates blast radius — a Connectivity or policy mistake hits everyone at once
Non-overlapping IP plan + hub peering means workloads can always interconnect Rigid guardrails can block legitimate innovation if exemptions aren’t easy and fast
Cost is attributable and capped (tags, budgets per vended sub) The reference architecture is a blueprint, not the answer — copying it without adapting causes mismatch
Identity isolated in its own platform subscription limits identity blast radius Operational maturity (IaC, PIM, change control) is a prerequisite — bolt it onto an immature org and it stalls

The model is right for organizations past the five-to-ten-subscription mark, regulated estates, and multi-year journeys where the foundation must outlive the first projects. It is wrong for a single-team startup on one subscription — there the overhead is pure cost. The disadvantages are all manageable: scope Deny policies at the narrowest MG that works, staff the platform team to demand, make exemptions a fast self-service path, and treat the reference architecture as a starting point you adapt. The failure mode is always the same — applying the full enterprise pattern to an organization that needed a “minimum viable landing zone,” or under-staffing the team that operates it.

Hands-on lab

Build a minimal but real landing-zone skeleton — a management-group tree, an inherited Deny guardrail, and a proof that inheritance works — all free (management groups and policy cost nothing). Run in Cloud Shell (Bash). You need permission to manage management groups at the tenant root (the Management Group Contributor role or higher); if you lack it, do this in a test tenant.

Step 1 — Create a small management-group tree.

az account management-group create --name lab-root --display-name "Lab Root"
az account management-group create --name lab-platform \
  --display-name "Lab Platform" --parent lab-root
az account management-group create --name lab-corp \
  --display-name "Lab Corp" --parent lab-root
az account management-group show --name lab-root --expand --query \
  "{name:displayName, children:children[].displayName}" -o json

Expected: lab-root with children Lab Platform and Lab Corp.

Step 2 — Assign an “Allowed locations” Deny guardrail at the root. Everything beneath inherits it.

ROOT_ID=$(az account management-group show -n lab-root --query id -o tsv)
az policy assignment create \
  --name "lab-allowed-locations" \
  --display-name "Lab: allowed locations (India only)" \
  --scope "$ROOT_ID" \
  --policy "e56962a6-4747-49cd-b67b-bf8b01975c4c" \
  --params '{ "listOfAllowedLocations": { "value": ["centralindia","southindia"] } }'

Expected: an assignment object returns with scope set to the lab-root MG.

Step 3 — Prove inheritance blocks a non-compliant deployment. Move a test subscription under lab-corp, then try to create a resource group in a disallowed region — the inherited root policy should deny it.

# Place a test/sandbox subscription under lab-corp (inherits the root deny)
az account management-group subscription add --name lab-corp \
  --subscription "<your-test-subscription-id>"
az account set --subscription "<your-test-subscription-id>"

# This SHOULD FAIL — westeurope is not in the allowed list inherited from the root
az group create -n rg-policy-test -l westeurope

Expected: a RequestDisallowedByPolicy error naming lab-allowed-locations. That error is the landing zone working — a guardrail assigned two levels up blocked a non-compliant deployment.

Step 4 — Confirm a compliant deployment succeeds.

az group create -n rg-policy-test -l centralindia   # allowed → succeeds

Expected: the resource group is created — same policy, compliant region, no block.

Validation checklist. You created a governance hierarchy, assigned a Deny guardrail at the top, and proved it inherits down to a subscription two levels below — blocking a disallowed region while permitting an allowed one. That is the entire landing-zone mechanism in four steps, no networking required. What each step proves:

Step What you did What it proves Real-world analogue
1 Built an MG tree The governance spine exists The intermediate-root + branches design
2 Assigned Deny at the root Guardrails live at the top scope Org-wide allowed-regions policy
3 Disallowed-region create failed Policy inherits down and blocks A real data-residency guardrail
4 Allowed-region create succeeded Guardrails permit compliant work Teams move fast inside the rails

Cleanup. Remove the assignment, move the subscription back, and delete the MGs (an MG must be empty of children/subscriptions to delete).

az policy assignment delete --name "lab-allowed-locations" --scope "$ROOT_ID"
az group delete -n rg-policy-test --yes --no-wait
# Move the sub back under the tenant root, then delete the lab MGs (leaf-first)
TENANT_ROOT=$(az account management-group list --query "[?properties.details.parent==null].name | [0]" -o tsv)
az account management-group subscription add --name "$TENANT_ROOT" --subscription "<your-test-subscription-id>"
az account management-group delete --name lab-corp
az account management-group delete --name lab-platform
az account management-group delete --name lab-root

Cost note. Management groups, policy assignments, and RBAC are free — this lab costs nothing. (The resource group you created is empty and also free; deleting it is just tidiness.)

Common mistakes & troubleshooting

The landing zone fails in a small number of well-known ways, almost all rooted in inheritance and remediation identities. First the playbook as a scannable table you can read mid-incident, then the detail for the ones that bite hardest.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 Every deployment estate-wide suddenly fails with RequestDisallowedByPolicy A Deny policy assigned too high (root/intermediate MG) az policy assignment list --scope <root-MG> -o table; the error names the assignment Delete/re-scope the assignment (not the definition) to the narrow MG
2 DeployIfNotExists/Modify policy shows everything non-compliant; nothing remediates Assignment’s managed identity has no role on the target az role assignment list --assignee <assignment-principalId> is empty Grant the MI the required role (e.g. Contributor) at the scope; run az policy remediation create
3 Logs not flowing to the central workspace despite a diagnostics policy DINE never remediated existing resources (only new ones at create) az policy state list --filter "complianceState eq 'NonCompliant'" Trigger a remediation task for existing resources
4 Two workloads can’t peer / VPN routes clash Overlapping spoke CIDRs (no IP plan) az network vnet show --query addressSpace on both Re-IP one spoke from the planned non-overlapping range
5 A subscription doesn’t get the expected guardrails It’s parented under the wrong MG (or still at tenant root) az account management-group subscription show-sub-under-mg? → check az account show MG Move it under the correct MG (vending does this)
6 A legitimate resource is blocked by a guardrail and the team is stuck No fast exemption path; policy too rigid The deny error names the policy Create a scoped, time-boxed policy exemption
7 Spoke egress bypasses the firewall (uninspected internet) Missing/incorrect UDR 0.0.0.0/0 → firewall az network route-table route list; effective routes on the NIC Add the UDR to the firewall private IP; check effective routes
8 RBAC grants far more than intended Role assigned at a high MG scope, inherited everywhere az role assignment list --scope <MG> --include-inherited Re-assign at the lowest scope that works; remove the broad one
9 Platform team is a bottleneck; teams wait weeks Manual onboarding; no vending (process observation) Implement subscription vending (accelerator module)
10 Policy change “didn’t take” / old state lingers Compliance evaluation is eventual (~24h) az policy state list shows stale; trigger on-demand scan Wait for propagation or trigger az policy state trigger-scan
11 Can’t delete an MG It still has child MGs or subscriptions az account management-group show --expand lists children Move children/subs out first, then delete leaf-first
12 Exemption isn’t relaxing the policy Exemption scoped wrong, or it’s a Modify/DINE (exemptions don’t “undo” deployed state) az policy exemption list --scope <scope> Scope the exemption to the exact resource/MG; for DINE, fix the resource directly

The expanded form for the failures that cause the most damage:

1. Every deployment estate-wide suddenly fails with RequestDisallowedByPolicy. Root cause: A Deny policy (often “deny public IP” or “allowed locations” with too narrow a list) was assigned at the intermediate root or tenant root instead of the specific landing-zone MG — and since policy inherits down, it now blocks legitimate deployments across every subscription beneath, including platform subscriptions. Confirm: The deployment error names the assignment. List assignments at the high scope: az policy assignment list --scope $(az account management-group show -n contoso --query id -o tsv) -o table. A flood of denials in az policy state list --filter "complianceState eq 'NonCompliant'" corroborates. Fix: Delete or re-scope the assignment (not the policy definition — deleting the definition is slower and can orphan correct assignments elsewhere): az policy assignment delete --name <assignment> --scope <high-MG>, then re-create it at the narrow MG (e.g. Corp). Deployments recover within the propagation window (minutes). This is the single most common and most alarming landing-zone incident.

2. A DeployIfNotExists or Modify policy reports everything non-compliant and never remediates. Root cause: DINE/Modify assignments run remediation as a managed identity, and that identity needs an RBAC role on the target scope (e.g. Contributor to deploy a diagnostic setting). If the assignment was created without --mi-system-assigned/a role, or the role grant was missed, remediation silently no-ops — compliance shows red forever with no error anywhere obvious. Confirm: az role assignment list --assignee <assignment-principalId> -o table returns empty (find the principal via az policy assignment show --name <a> --query identity.principalId). Fix: Ensure the assignment has an identity and grant it the role at the scope, then trigger remediation:

az role assignment create --assignee <principalId> --role "Contributor" \
  --scope $(az account management-group show -n contoso --query id -o tsv)
az policy remediation create --name fix --policy-assignment <assignment> --management-group contoso

4. Two workloads can’t peer or their VPN routes clash. Root cause: Spokes were given overlapping CIDRs because there was no central IP plan — the original sprawl problem, recreated. Overlapping VNets cannot peer, and overlapping on-prem routes break hybrid routing. Confirm: az network vnet show -g <rg> -n <vnet> --query addressSpace.addressPrefixes on both shows colliding ranges. Fix: Re-IP one spoke from the planned non-overlapping range (the painful, production-affecting fix that the IP plan exists to prevent). Vending must allocate CIDRs from a central plan so this can never recur.

7. Spoke egress bypasses the firewall. Root cause: The spoke is missing the UDR that forces 0.0.0.0/0 to the firewall’s private IP (or the route table isn’t associated with the subnet), so traffic egresses directly to the internet, uninspected — a security and compliance gap that audits flag. Confirm: Check effective routes on a NIC in the spoke: az network nic show-effective-route-table -g <rg> -n <nic> -o table — the default route should point at the firewall (VirtualAppliance), not Internet. Fix: Create the UDR to the firewall private IP and associate the route table with the spoke’s subnets; re-check effective routes.

Best practices

Security notes

The security guardrails that also enforce the architecture — where secure and well-governed pull in the same direction:

Control Mechanism Secures against Also enforces
Deny public IP (Corp) Deny policy at Corp MG Unsanctioned internet exposure The private-workload class boundary
Require private endpoints Deny/Audit policy PaaS reached over the public internet Hub private-DNS resolution discipline
Force egress via firewall UDR 0.0.0.0/0 → firewall Uninspected/exfiltration egress Centralized inspection model
Central diagnostics DeployIfNotExists to one workspace Blind spots in detection Complete observability coverage
PIM for privileged roles JIT elevation Standing admin compromise No-standing-privilege principle
Defender everywhere DINE enabling Defender Untriaged threats per-team Org-wide secure-score baseline

Cost & sizing

A landing zone’s governance layer is nearly free; the cost is in the shared infrastructure it stands up and the discipline it brings to workload spend. The drivers:

A rough monthly picture for the shared platform of a mid-size estate, and what each line buys:

Cost line What you pay for Rough INR / month What it delivers How to right-size
Management groups / Policy / RBAC The entire governance layer ₹0 The whole control plane Nothing to size — it’s free
Azure Firewall (Standard) Centralized egress inspection ~₹65,000–95,000 + per-GB Inspected, FQDN-filtered egress Firewall Basic for small estates
VPN gateway Hybrid connectivity ~₹15,000–40,000 On-prem reachability Right SKU to throughput; skip if cloud-only
ExpressRoute gateway + circuit Private hybrid at scale Circuit-dependent High-bandwidth private hybrid Only if you need private/high-bandwidth
Azure Bastion Jump-box-free admin access ~₹13,000–20,000 Secure RDP/SSH, no public VM IPs Basic SKU for low concurrency
DDoS Network Protection (plan) L3/L4 volumetric defense ~₹2.4 lakh (flat) Protects all VNets in the plan Amortize across many VNets; or per-IP SKU
Central Log Analytics Estate-wide telemetry Ingestion-dependent One place to hunt + Defender data Basic Logs, table retention, daily cap

The honest floor: governance itself is free, so a minimum viable landing zone (the MG tree + core policy + central logging, without a firewall/DDoS hub) costs almost nothing and is the right starting point for a small org — add the connectivity hub’s cost only when real workloads need centralized egress and hybrid.

Interview & exam questions

1. What is an enterprise-scale landing zone, and how is it more than a network topology? It is Microsoft’s prescriptive Cloud Adoption Framework architecture for running Azure at scale — a complete operating model expressed as Azure resources: a management-group hierarchy for governance inheritance, platform vs application subscriptions, hub-and-spoke (or Virtual WAN) connectivity, Azure Policy guardrails, centralized identity and logging, and subscription vending. It is not just a hub VNet; the network is one of five pillars (resource organization, governance, network, identity, operations).

2. Why align management groups to governance rather than the org chart? Org charts re-organize frequently; governance requirements (Corp vs Online, regulated vs not) are stable. Modeling the org chart (Marketing MG, EMEA MG) means constant re-parenting and policy that doesn’t match real control needs. Modeling by governance — what must be governed differently — means the tree and its inherited guardrails stay correct through re-orgs.

3. Explain how Azure Policy and RBAC inheritance works in the hierarchy, and the risk it creates. Both inherit downward: an assignment at a management group applies to every child MG, subscription, resource group, and resource beneath it. The power is that one assignment governs a whole branch; the risk is that a too-strict Deny at a high scope silently blocks deployments across the entire estate. Policy is additive and Deny is sticky — a child can be stricter but never looser; the only relaxation is a scoped exemption.

4. What are the platform subscriptions and why don’t workloads run in them? Identity (domain/Entra DS, PKI), Management (central Log Analytics, automation, Sentinel/Defender), and Connectivity (hub VNet, firewall, gateways, DNS). Workloads are excluded to contain blast radius (a workload incident can’t take down identity/connectivity), keep cost attribution clean, and avoid granting workload teams access to the shared core.

5. When would you choose Virtual WAN over traditional hub-and-spoke? Choose Virtual WAN when you have many global branches / heavy site-to-site VPN, need region-to-region any-to-any transit as a baseline, and want minimal routing plumbing (Microsoft manages the hub). Choose hub-and-spoke when you have a few regions, a strong networking team, need maximum control over routing and the firewall rule base, or run complex third-party NVA chains.

6. What’s the difference between Deny, Audit, and DeployIfNotExists policy effects? Audit records non-compliance but allows the action (visibility only). Deny rejects the non-compliant create/update at the control plane (prevention). DeployIfNotExists remediates by deploying a missing related resource (e.g. a diagnostic setting) using a managed identity — it doesn’t block, it fixes. The classic pitfall: DINE (and Modify) need the assignment’s managed identity to hold an RBAC role on the target, or remediation silently no-ops.

7. A DeployIfNotExists policy shows everything non-compliant and nothing is remediating. What’s wrong and how do you confirm? The assignment’s managed identity lacks the required RBAC role on the target scope, so remediation can’t deploy anything. Confirm with az role assignment list --assignee <assignment-principalId> returning empty. Fix by granting the identity the role (e.g. Contributor) at the scope and running az policy remediation create to remediate existing resources (DINE only auto-deploys for new resources at create).

8. Every deployment across the estate suddenly fails with RequestDisallowedByPolicy. What happened and what’s the fix? A Deny policy (e.g. deny-public-IP or a too-narrow allowed-locations list) was assigned at too high a scope (intermediate/tenant root) and, because policy inherits down, now blocks legitimate deployments everywhere. The error names the assignment. Fix by deleting/re-scoping the assignment (not the definition) to the narrow MG (e.g. Corp); deployments recover within the propagation window.

9. What is subscription vending and what does it provision? A templated process (Bicep/Terraform/accelerator) that creates a subscription and, critically, places it under the correct management group so it inherits the right guardrails instantly, then peers its spoke to the hub, allocates a non-overlapping CIDR from the IP plan, applies tags and a budget, wires diagnostic settings to the central workspace, and grants the team Contributor at their subscription scope. It turns onboarding from weeks to hours.

10. How does a landing zone prevent the overlapping-CIDR problem? With a central IP plan: a documented, non-overlapping address space (e.g. a /14 carved into per-spoke /24s) from which vending allocates every spoke. Because no two spokes ever share address space, they can always peer to the hub and to each other, and hybrid routes never clash — eliminating the re-IP-ing catastrophe that ungoverned, team-chosen 10.0.0.0/16 ranges cause.

11. What are the key limits on the management-group hierarchy? Up to ~10,000 MGs per directory, a maximum depth of 6 levels below the tenant root, a subscription belongs to exactly one MG at a time, and policy/RBAC inherit at every level down to the resource. The six-level depth cap is the design discipline: a deeper tree is almost always an org chart in disguise and should be flattened.

12. How do you keep guardrails from blocking legitimate innovation? Scope Deny policies at the narrowest MG that works (not the root), and make policy exemptions a fast, self-service, scoped, time-boxed path so a team blocked by a guardrail on a legitimate resource is unblocked in minutes — without weakening the baseline for everyone. Rigid guardrails only harm when exemptions are slow.

These map to AZ-305 (Designing Microsoft Azure Infrastructure Solutions)design governance, design identity and access, design network solutions — and AZ-104 (Administrator)manage Azure identities and governance (management groups, Azure Policy, RBAC). The connectivity content touches AZ-700. A compact cert mapping for revision:

Question theme Primary cert Exam objective area
MG hierarchy, governance design AZ-305 Design governance
Policy effects, initiatives, exemptions AZ-104 / AZ-305 Manage governance; design governance
Platform vs landing-zone subscriptions AZ-305 Design infrastructure / governance
Hub-and-spoke vs Virtual WAN AZ-305 / AZ-700 Design network solutions
RBAC scope, PIM, least privilege AZ-305 / AZ-500 Design identity and access
Subscription vending, IP planning AZ-305 Design infrastructure

Quick check

  1. Your organization has a Marketing MG, a Finance MG, and an EMEA MG. What is the design smell, and what should the tree model instead?
  2. A DeployIfNotExists diagnostics policy reports every resource non-compliant and no logs are flowing. What is the single most likely cause, and how do you confirm it?
  3. True or false: assigning a Deny policy at a child MG can loosen a Deny inherited from a parent MG.
  4. Two new workloads can’t peer their VNets. What governance failure most likely caused this, and what does a landing zone do to prevent it?
  5. Name the three platform subscriptions and the one rule about what runs in them.

Answers

  1. The smell is modeling the org chart. Org charts re-organize, forcing constant subscription re-parenting and producing policy that doesn’t match real control needs. The tree should model governance requirements instead — e.g. a Platform branch and Landing-Zones branches split into Corp (private) and Online (internet-facing), because those classes need genuinely different guardrails.
  2. The assignment’s managed identity lacks an RBAC role on the target scope, so remediation silently no-ops. Confirm with az role assignment list --assignee <assignment-principalId> (find the principal via az policy assignment show --query identity.principalId) returning empty. Fix: grant the identity the role (e.g. Contributor) at the scope and run az policy remediation create.
  3. False. Policy is additive and Deny is sticky — a child scope can only make things stricter, never looser. The only way to relax an inherited policy for a specific resource is a scoped policy exemption, not a contrary assignment.
  4. Overlapping CIDRs because the spokes were given team-chosen, colliding address space (overlapping VNets can’t peer). A landing zone prevents this with a central IP plan from which subscription vending allocates every spoke a non-overlapping range.
  5. Identity (domain/Entra DS, PKI), Management (central Log Analytics, automation, Defender/Sentinel), and Connectivity (hub VNet, firewall, gateways, DNS). The rule: no business workload ever runs in them — they hold shared platform services only, to contain blast radius and keep cost attribution clean.

Glossary

Next steps

You can now design the governance spine, stand up the platform and connectivity, and vend governed subscriptions. Build outward:

AzureLanding ZoneEnterprise-ScaleGovernanceManagement GroupsAzure PolicyHub-and-SpokeCloud Adoption Framework
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading