Portal-clicked policy is governance you cannot review, diff, or roll back. A rule assigned by hand in a blade has no pull request, no reviewer, and no recorded why — and the day an auditor asks “who approved exempting this subscription from disk encryption, and when does the waiver expire?”, clicking through blades does not produce an answer. The fix is to treat policy the way you treat infrastructure: definitions, initiatives, and assignments live in Git, get tested in a pipeline, and deploy to management groups through a promotion ring. This guide builds that pipeline end to end with EPAC (Enterprise Policy as Code), and it handles the part the quickstarts skip — remediating thousands of existing resources without melting the ARM control plane.
You will learn the four-object model (definition, initiative, assignment, exemption) and why keeping them separate is the entire discipline; how to choose an effect without either blocking legitimate deploys or auditing forever; how to validate cheaply with lint → What-If → an Audit ring before you ever enforce Deny; and how to run remediation tasks in throttled, per-landing-zone batches so a 429 Too Many Requests storm never leaves you with a half-fixed fleet. Because this is a reference you will return to mid-rollout, the effects, the modes, the EPAC commands, the pipeline gates, the failure modes and the cost drivers are all laid out as scannable tables — read the prose once, then keep the tables open while the pipeline runs.
By the end you will stop governing by mouse. When a control needs to change you will open a PR, read the What-If, watch the Audit blast radius in a sandbox management group, promote the same definition outward by flipping one parameter, and prove Git and Azure are in sync with a clean EPAC plan that reports no changes. That last property — a no-op plan as the definition of “in sync” — is what separates a governed estate from a pile of orphaned assignments nobody can account for.
What problem this solves
Governance that lives only in the portal rots in three predictable ways. It is unreviewable — there is no diff showing that someone widened a deny to an audit, no approver on the change, no commit message explaining the threshold. It is undeployable — you cannot stamp the same baseline across 600 subscriptions by hand without drift creeping in, and you certainly cannot recreate it after a tenant rebuild. It is unaccountable — exemptions become permanent holes because a portal exemption with no expiry and no ticket reference is indistinguishable from “someone turned this off and forgot.”
What breaks without a pipeline: a platform team ships a guardrail straight to Deny in production, discovers it blocks 4,000 legitimate deployments, and rolls it back in a panic — teaching everyone that governance is the enemy. Or a DeployIfNotExists policy with a wrong existenceCondition redeploys its template on every 24-hour scan, quietly burning ARM quota and money for months. Or a remediation task fans out tenant-wide at default concurrency, hits 429, and leaves half the fleet fixed and half not — so the next compliance scan re-flags everything and on-call cannot tell what actually changed.
Who hits this: any platform or cloud-governance team operating at landing-zone scale — the Azure Cloud Adoption Framework landing zones crowd, anyone running an enterprise-scale management-group hierarchy, and every shop that has graduated past “click a built-in initiative and hope.” It pairs with Azure Policy governance at scale (the conceptual ground this pipeline automates) and the Azure DevOps YAML multistage approvals patterns that gate it. The reward is governance you can review, diff, roll back, and prove — the same properties you already demand of your infrastructure code.
To frame the whole field before the deep dive, here is every failure class this pipeline can hit, the question it forces, and the one place to look first:
| Failure class | What you observe | First question to ask | First place to look | Most common single cause |
|---|---|---|---|---|
| Drift / orphaned assignment | EPAC plan wants to delete something live | Did someone change it in the portal? | Build-DeploymentPlans plan output |
A hand-edited assignment not in Git |
| Effect too aggressive | New deploys suddenly blocked | Did we flip to Deny before measuring? |
az policy state summarize audit count |
Skipped the Audit ring |
| DINE/Modify won’t deploy | “required role assignments” error | Does the identity exist and have roles? | Assignment identity + role list |
MSI replication lag or missing role |
| Remediation 429s | Half the fleet fixed, half re-flagged | Was the task throttled and scoped? | Remediation task failure column | Tenant-wide blast, default concurrency |
| Exemption sprawl | Compliance “clean” but holes everywhere | Are exemptions time-bound and in Git? | az policy exemption list expiry column |
Portal exemption with no expiresOn |
| Compliance shows NotStarted | Dashboard empty after deploy | Has a scan run yet? | az policy state summarize |
No on-demand scan triggered |
Learning objectives
By the end of this article you can:
- Separate the four policy object types — definition, initiative, assignment, exemption — and explain which describes capability, which describes enforcement, and which describes the documented exception.
- Author a custom policy definition correctly: distinguish
fieldfromvalue, enumerate aliases before writing a rule, and pickIndexedvsAllmode deliberately. - Choose the right effect for a control —
Audit,Deny,Modify,Append,DeployIfNotExists,AuditIfNotExists,Disabled— and name what each can and cannot fix. - Structure a policy repo by logical identity (never by scope) and wrap every definition in an initiative for a stable assignment surface and one compliance roll-up.
- Drive the EPAC workflow —
Build-DeploymentPlans→Deploy-PolicyPlan→Deploy-RolesPlan— across a plan/deploy pipeline with an approval gate, and configureglobal-settings.jsoncwith apacOwnerIdso drift removal is safe. - Validate before rollout with lint → What-If → an
Auditring, read the audit count as your blast radius, and promote the same definition throughsandbox → nonprod → prodby flipping one parameter. - Remediate existing fleets with DINE/Modify + remediation tasks, control concurrency with
--parallel-deploymentsand--resource-count, and keep429storms from leaving a half-fixed estate. - Write time-bound, ticket-referenced exemptions for break-glass and waivers, and prove Git/Azure parity with a no-op EPAC plan.
Prerequisites & where this fits
You should already understand the building blocks of Azure governance: a management group (MG) is a container above subscriptions that policy and RBAC inherit down through; an Azure Policy assignment binds a rule to a scope (MG, subscription, or resource group) with parameter values; and RBAC (role assignments) is how the policy engine is granted permission to act on your behalf for Modify/DINE. You should be comfortable running az in Cloud Shell, reading JSON output, writing basic Bicep, and reading a YAML pipeline. PowerShell familiarity helps because EPAC is a PowerShell module.
This sits in the Governance & Platform Automation track. It assumes the conceptual ground from Azure Policy governance at scale and the hierarchy design in enterprise-scale management-group hierarchy design. It depends on the identity model in Entra RBAC governance, because the deploying principal’s permissions are the whole security boundary. It is one rung above infrastructure as code 101 with Terraform on Azure in mindset, and it pairs with Bicep deployment stacks, What-If & CI for the validation mechanics and Azure DevOps YAML multistage approvals for the gates.
A quick map of who owns what during a policy change, so you route the right approval fast:
| Layer | What lives here | Who usually owns it | What it can block / cause |
|---|---|---|---|
| Git repo (definitions, initiatives) | The capability — the rules themselves | Platform / governance team | Bad rule logic; broken alias reference |
| Assignment manifests | The enforcement — scope + effect param | Platform team + control owner | Wrong scope; effect too aggressive |
global-settings.jsonc |
PaC env → MG/sub mapping + pacOwnerId |
Platform lead | Drift removal scope; safe-delete boundary |
| Pipeline (plan/deploy stages) | The promotion ring + approval gate | DevOps / platform | Who can merge to prod; gate bypass |
| Deploying service principal | The permission to write policy + roles | Identity team | PrincipalNotFound; missing UAA role |
| Exemptions tree | The documented exceptions | Control owner + approver | Sprawl; un-expiring holes |
Core concepts
Five mental models make every later decision obvious.
The four object types describe four different things, and mixing them is how repos rot. A definition is a single rule — an if/then — authored once and assignable anywhere. An initiative (policy set) bundles definitions and hoists their parameters so an assignment sets values once. An assignment binds a definition or initiative to a scope with concrete parameter values; this is the only object that knows where enforcement happens. An exemption is a time-bound, audited waiver of an assignment, down to the resource. Definitions and initiatives describe capability; assignments describe where it is enforced; exemptions describe the documented exceptions. Baking a subscription ID into a definition collapses two of these into one and you can never reuse or promote that rule again.
Policy intercepts the request, then re-checks on a scan. Most effects evaluate at resource create/update — the request is intercepted before it commits, which is exactly why Deny can block it and Modify/Append can mutate the payload in flight. Separately, a roughly 24-hour background compliance scan re-evaluates existing resources. The two *IfNotExists effects (DINE, AINE) only ever fire on writes and on that scan — never inline — because they must inspect related resources that already exist. Knowing which trigger an effect uses tells you whether it can fix the fleet or only stop new violations.
Aliases are the contract, and field ≠ value. field reads a property of the resource being evaluated and is alias-aware — Microsoft.Storage/storageAccounts/allowBlobPublicAccess is an alias mapping to the resource’s real property path. value evaluates an arbitrary expression (a parameter, [resourceGroup()], a template function) that has nothing to do with the target resource. Use field for “what is this resource’s property”; use value for “what does this expression compute to.” Crucially, a field condition lets deny/modify reach into the request payload before commit; a value condition cannot. If there is no alias for a property, you cannot write policy against it — so you enumerate aliases first.
EPAC reconciles desired state and that is the whole point. You can hand-roll deployment with az policy commands, but reconciling the repo against what is live — including deleting assignments you removed from Git — is what EPAC automates. It reads your repo, builds a plan, and applies it idempotently with full drift detection: anything in Azure stamped with your pacOwnerId that is not in Git gets flagged and (optionally) removed. The pacOwnerId is the safety boundary — EPAC never touches objects it did not stamp, so it cannot delete another team’s assignments or Microsoft’s built-in initiatives.
The effect is the most consequential single decision. Pick wrong and you either block legitimate deployments (Deny too early) or audit forever while nothing improves (Audit with no promotion plan). The parameterized-effect pattern — ship the same definition as Audit in sandbox, Audit/Deny in nonprod, Deny/DINE in prod — is the entire value of keeping the effect a parameter on the assignment rather than baked into the definition.
The vocabulary in one table
Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:
| Concept | One-line definition | Where it lives | Why it matters to the pipeline |
|---|---|---|---|
| Definition | A single if/then rule |
Git → MG scope | The reusable capability; never scope-bound |
| Initiative | A bundle of definitions + params | Git → MG scope | Stable assignment surface; one roll-up |
| Assignment | Definition/initiative bound to a scope | Git → applied at scope | The only object that says where + effect |
| Exemption | Time-bound waiver of an assignment | Git → down to resource | The documented exception, not a hole |
| Effect | What the rule does when matched | On the definition, param on assignment | Audit/Deny/Modify/DINE… |
| Alias | Map from policy field → resource property | Resource provider | No alias → no policy on that property |
| Mode | Which resource types are evaluated | On the definition | Indexed vs All |
| EPAC | PowerShell module that reconciles repo↔Azure | Pipeline agent | Plan/deploy + drift detection |
pacOwnerId |
The stamp EPAC manages by | global-settings.jsonc |
Safe-delete boundary for drift |
| Remediation task | Bulk fix of existing non-compliant resources | ARM control plane | One deployment per resource; throttled |
| Compliance scan | ~24h re-evaluation of existing resources | Platform-managed | When DINE/AINE/audit data refreshes |
| What-If | Preview of what a deployment would change | ARM API / pipeline | Cheapest pre-merge validation |
The effects reference
The effect is the single most consequential field in a policy. This is the lookup table you scan first — every effect, when it fires, what it can fix, and the non-obvious requirement that bites. The traps are that Deny cannot fix existing resources, that Modify and DINE both need a managed identity with concrete roles, and that DINE’s existenceCondition is what makes it idempotent or a money pit.
| Effect | Fires when | Can fix existing? | Needs identity? | Use it for | Key gotcha |
|---|---|---|---|---|---|
Audit |
On write + on scan; only flags | No (reports only) | No | Measuring a new rule’s blast radius | The audit count is your blast radius |
Deny |
On write, before commit | No — stops new only | No | Hard guardrails (“no public IPs in Corp”) | Existing drift untouched; pair with Modify/DINE |
Modify |
On write; mutates payload | With remediation task | Yes (MSI + roles) | Add/replace tags, enforce properties | Needs location on the assignment |
Append |
On write; adds fields | No (write-time only) | No | Inject a default where none supplied | Cannot change an existing value, only add |
DeployIfNotExists (DINE) |
On write + on scan, if related resource missing | With remediation task | Yes (MSI + roles) | Auto-onboard diagnostics, Defender, backup | Wrong existenceCondition → redeploys every scan |
AuditIfNotExists (AINE) |
On write + on scan | No (reports only) | No | Report on missing companion resources | Same existenceCondition care, no money risk |
Manual |
Sets compliance you attest manually | N/A (attestation) | No | Controls Azure can’t technically check | Compliance is set by a human, not the engine |
Disabled |
Never | No | No | Kill one rule inside an initiative | No audit trail — prefer an exemption |
DenyAction |
On a delete (or specified operation) | N/A | No | Block deletion of protected resources | Newer; scope which operations carefully |
Two reading notes that save the most time:
| Distinction | The trap | How to tell them apart |
|---|---|---|
Deny vs Modify for the same property |
Teams Deny a missing tag, then can’t deploy anything |
Deny blocks the deploy; Modify adds the tag for you — usually what you want for tags |
Disabled vs exemption |
Disabling kills the rule for everyone, silently | Disabled = no scope, no expiry, no trail; an exemption is scoped, expiring, audited |
And the effect-by-mode interaction, because Modify/Append/DINE have extra requirements:
| Effect | Requires roleDefinitionIds in definition? |
Requires details.operations (Modify)? |
Requires existenceCondition (DINE/AINE)? |
Requires deployment template (DINE)? |
|---|---|---|---|---|
Audit / Deny / Append |
No | No | No | No |
Modify |
Yes | Yes (add/replace/remove ops) | No | No |
AuditIfNotExists |
No | No | Yes | No |
DeployIfNotExists |
Yes | No | Yes | Yes (ARM template) |
Manual / Disabled |
No | No | No | No |
Anatomy of a custom policy definition
A definition is JSON with a policyRule (the logic) and parameters (the knobs). The rule’s if block evaluates resource properties; the matched resources get the then.effect.
{
"properties": {
"displayName": "Storage accounts must disable public blob access",
"mode": "Indexed",
"parameters": {
"effect": {
"type": "String",
"defaultValue": "Deny",
"allowedValues": ["Audit", "Deny", "Disabled"]
}
},
"policyRule": {
"if": {
"allOf": [
{ "field": "type", "equals": "Microsoft.Storage/storageAccounts" },
{
"field": "Microsoft.Storage/storageAccounts/allowBlobPublicAccess",
"notEquals": false
}
]
},
"then": { "effect": "[parameters('effect')]" }
}
}
}
Before you write the rule, enumerate the aliases — if there is no alias for a property, you cannot target it:
# List aliases for a resource type and confirm they're modifiable (needed for modify/append)
az provider show --namespace Microsoft.Storage \
--expand "resourceTypes/aliases" \
--query "resourceTypes[?resourceType=='storageAccounts'].aliases[].{alias:name, modifiable:defaultMetadata.attributes}" \
-o table
The if block: conditions, operators, and logical structure
The if block is a tree of conditions joined by allOf/anyOf/not. Each leaf compares a field or value against an operator. Knowing the full operator set — and which ones are case-sensitive or accept wildcards — is what lets you write a precise rule instead of an over-broad one that flags half the estate.
| Operator | Compares | Wildcards? | Typical use | Gotcha |
|---|---|---|---|---|
equals / notEquals |
Exact scalar | No | Resource type, a boolean property | Case-sensitive on strings |
like / notLike |
String with * |
Yes (*) |
name like "prod-*" |
Single *; not regex |
match / notMatch |
#=digit ?=letter .=any |
Yes (glyph) | Naming patterns by char class | Case-sensitive |
matchInsensitive |
Same as match, case-insensitive |
Yes | Naming patterns, any case | Slightly slower to reason about |
contains / notContains |
Substring | No | Tag value contains a token | Substring, not membership |
in / notIn |
Membership in an array | No | Allowed locations/SKUs list | Array must be a param or literal |
containsKey / notContainsKey |
Object has a key | No | tags containsKey "cost-center" |
Key presence, not value |
greater / less / greaterOrEquals / lessOrEquals |
Numeric / date | No | Retention days, minTLS version | Type must compare cleanly |
exists |
"true"/"false" |
No | Property present at all | String boolean, not bare bool |
The logical operators and how they nest:
| Logical op | Semantics | When to reach for it | Pitfall |
|---|---|---|---|
allOf |
AND — every child must match | The default; scope a rule to a type + condition | Forgetting it makes a single condition implicit |
anyOf |
OR — at least one child matches | “TLS < 1.2 OR public access on” | Easy to make too broad |
not |
Negate the wrapped condition | “NOT in the allowed-SKU list” | Double negatives get unreadable fast |
count |
Count array elements matching a condition | “≥1 NSG rule allows 0.0.0.0/0” | The most powerful and the easiest to misread |
field (in count) |
Iterate an array alias [*] |
Inspect each subnet/rule/IP config | Needs an [*] alias to exist |
field vs value, and mode
field reads a property of the target resource and is alias-aware; value evaluates an arbitrary expression. Use field to inspect the resource, value to compute something independent of it. The distinction also governs power: a field condition can drive deny/modify into the request payload, a value condition cannot. The mode then decides which resource types are even evaluated.
| Mode | Evaluates | Use it for | Skips | Gotcha |
|---|---|---|---|---|
Indexed |
Resource types that support tags + location | The vast majority of resource policies | RGs, subscriptions, type-less resources | Default-correct; avoids false non-compliance on type-less resources |
All |
Every resource, plus RGs and subscriptions | Policies that must evaluate RGs/subs themselves | Nothing | Use only when you genuinely target containers |
Microsoft.Kubernetes.Data |
AKS in-cluster objects (via Gatekeeper) | Pod-level constraints on AKS | Non-AKS | Pairs with the Gatekeeper/OPA admission model |
Microsoft.KeyVault.Data |
Objects inside Key Vault (certs, keys) | Cert/key policy within a vault | Non-KV-data | Data-plane mode, different alias set |
Microsoft.Network.Data |
Specific network data-plane objects | Niche network controls | Others | Rarely needed |
# field vs value in practice: 'field' reads the resource; 'value' computes from a function
# This audits resources whose location is NOT in the resource group's allowed set.
az policy definition create --name "loc-must-match-rg" \
--rules '{
"if": { "allOf": [
{ "field": "location", "notIn": "[parameters(\"allowedLocations\")]" },
{ "value": "[resourceGroup().location]", "notEquals": "global" }
]},
"then": { "effect": "audit" }
}' \
--params '{ "allowedLocations": { "type": "Array" } }' \
--mode Indexed
Evaluation order, restated as a rule: policy runs on resource create/update (request intercepted before commit — why
denyblocks andmodify/appendmutate), and again on the ~24-hour background compliance scan.auditIfNotExists/deployIfNotExistsonly fire on that scan and on writes, never inline, because they inspect related resources that already exist. If your dashboard showsNotStarted, no scan has run yet — trigger one rather than assuming the policy is broken.
Choosing and parameterizing the effect
The effect decides whether a control measures, blocks, or fixes. The reference table above enumerates all of them; the discipline is to make the effect a parameter on the assignment, not a constant in the definition, so the same rule can ship Audit then Deny per ring without a code change.
The non-obvious rules, restated for the decision you actually face:
| You want to… | Wrong effect (and why) | Right effect | Extra requirement |
|---|---|---|---|
| Stop new public storage | Modify (can’t remove a missing property cleanly) |
Deny |
None |
| Tag every new resource with cost-center | Deny (blocks the deploy) |
Modify |
MSI + Tag Contributor; location set |
| Onboard diagnostics to Log Analytics | Deny/Append (can’t create a child) |
DeployIfNotExists |
MSI + roles; correct existenceCondition |
| Report which VMs lack backup | Deny (forward-only) |
AuditIfNotExists |
Correct existenceCondition |
| Measure a brand-new control’s reach | Deny (breaks teams immediately) |
Audit |
None — read the count first |
| Fix existing untagged resources | Deny (never touches them) |
Modify + remediation task |
MSI + roles + throttled remediation |
A worked parameterized definition — one rule, three ring behaviours from a single effect parameter:
{
"properties": {
"displayName": "Resources must carry a cost-center tag",
"mode": "Indexed",
"parameters": {
"effect": {
"type": "String",
"defaultValue": "Audit",
"allowedValues": ["Audit", "Modify", "Disabled"]
},
"tagName": { "type": "String", "defaultValue": "cost-center" }
},
"policyRule": {
"if": { "field": "[concat('tags[', parameters('tagName'), ']')]", "exists": "false" },
"then": {
"effect": "[parameters('effect')]",
"details": {
"roleDefinitionIds": [
"/providers/Microsoft.Authorization/roleDefinitions/4a9ae827-6dc8-4573-8ac7-8239d42aa03f"
],
"operations": [
{ "operation": "add", "field": "[concat('tags[', parameters('tagName'), ']')]", "value": "unassigned" }
]
}
}
}
}
}
Sandbox assigns effect=Audit and reads the count; prod assigns effect=Modify and pairs it with a remediation task. Same Git, same definition.
Structuring the repo
Keep the four object types in separate trees, named by their logical identity, never by scope. Scope-naming (prod-sub-storage.json) is how you end up unable to promote or reuse anything.
policy/
├── policy-definitions/
│ └── deny-storage-public-access.json
├── policy-set-definitions/ # initiatives
│ └── security-baseline.json
├── policy-assignments/
│ ├── platform-mg.json # one manifest per management group
│ └── landing-zones-mg.json
├── policy-exemptions/
│ └── prod/ # exemptions, segregated by environment
└── global-settings.jsonc # PaC env -> MG/subscription mapping
An initiative groups definitions and hoists their parameters so an assignment sets values once:
{
"properties": {
"displayName": "Security Baseline",
"policyType": "Custom",
"parameters": {
"storageEffect": { "type": "String", "defaultValue": "Deny" }
},
"policyDefinitions": [
{
"policyDefinitionReferenceId": "denyStoragePublic",
"policyDefinitionId": "/providers/Microsoft.Management/managementGroups/contoso/providers/Microsoft.Authorization/policyDefinitions/deny-storage-public-access",
"parameters": { "effect": { "value": "[parameters('storageEffect')]" } }
}
]
}
}
Always assign initiatives, not loose definitions — even an initiative of one. It gives you a stable assignment surface (add rules later without re-pointing the assignment) and one compliance roll-up per business control. The repo-layout decisions and what each buys you:
| Directory / file | Holds | Named by | Why it matters |
|---|---|---|---|
policy-definitions/ |
Single rules | The capability (deny-storage-public-access) |
Reusable anywhere; promotion-safe |
policy-set-definitions/ |
Initiatives | The business control (security-baseline) |
One roll-up; stable assignment target |
policy-assignments/ |
One manifest per MG | The scope it targets (platform-mg) |
The only place scope appears |
policy-exemptions/<env>/ |
Exemptions | Ticket + resource | Lifecycle-managed; expires on its own |
global-settings.jsonc |
PaC env → scope + pacOwnerId |
The environment selector | Safe-delete boundary; per-ring mapping |
The object hierarchy as a quick reference for what nests in what:
| Object | Contains | Contained by | Carries parameters? | Carries scope? |
|---|---|---|---|---|
| Definition | policyRule + parameters |
Initiative (by reference) or assigned directly | Declares them | No |
| Initiative | References to definitions | An assignment | Hoists definition params | No |
| Assignment | A definition or initiative ID + param values | Applied at a scope | Sets values | Yes |
| Exemption | A reference to an assignment (+ optional definition refs) | A scope, down to a resource | No | Yes + expiry |
Built-in vs custom definitions — when to author your own vs reference Microsoft’s, and how EPAC treats each:
| Aspect | Built-in definition | Custom definition |
|---|---|---|
| Authored by | Microsoft | You (in Git) |
policyType |
BuiltIn |
Custom |
| Lives at | Tenant root (always available) | The MG scope you deploy it to |
| Versioned / updated by | Microsoft (can change under you) | You — diffable in PRs |
| EPAC manages it? | References only (never deletes) | Yes — owned via pacOwnerId |
| Use when | A standard control already exists (e.g. CIS, Defender) | No built-in matches your exact rule |
| Gotcha | A Microsoft update can shift behaviour | You own maintenance and alias drift |
| Composes in an initiative with custom? | Yes — mix freely under one roll-up | Yes |
The EPAC workflow
You can hand-roll deployment with az policy commands, but reconciling desired state against what is live — including deleting assignments you removed from Git — is exactly what EPAC (Enterprise Policy as Code) solves. It is a maintained PowerShell module that reads your repo, builds a plan, and applies it idempotently, with full drift detection: anything in Azure stamped with your owner ID that is not in Git is flagged and (optionally) removed.
EPAC’s three commands map cleanly onto a pipeline:
Install-Module -Name EnterprisePolicyAsCode -Scope CurrentUser
# 1. PLAN — diff desired (repo) vs. deployed; emit plan artifacts, change nothing
Build-DeploymentPlans `
-DefinitionsRootFolder ./policy `
-OutputFolder ./output `
-PacEnvironmentSelector epac-prod
# 2. DEPLOY definitions, initiatives, and assignments from the plan
Deploy-PolicyPlan `
-DefinitionsRootFolder ./policy `
-InputFolder ./output `
-PacEnvironmentSelector epac-prod
# 3. DEPLOY the role assignments DINE/modify identities need
Deploy-RolesPlan `
-DefinitionsRootFolder ./policy `
-InputFolder ./output `
-PacEnvironmentSelector epac-prod
The three commands, what each does, what it touches, and the artifact it produces or consumes:
| Command | Phase | Reads | Writes | Changes Azure? | Runs in pipeline stage |
|---|---|---|---|---|---|
Build-DeploymentPlans |
Plan | Repo + live Azure state | policy-plan.json, roles-plan.json |
No | Plan (on PR) |
Deploy-PolicyPlan |
Deploy objects | Policy plan artifact | Definitions, initiatives, assignments | Yes | Deploy (on merge) |
Deploy-RolesPlan |
Deploy roles | Roles plan artifact | Role assignments for MSIs | Yes | Deploy-roles (after deploy) |
Why EPAC over hand-rolling — the three ways to ship policy as code, side by side:
| Capability | Raw az policy scripts |
ARM/Bicep templates | EPAC |
|---|---|---|---|
| Create definitions/initiatives/assignments | Yes (imperative) | Yes (declarative) | Yes (declarative) |
| Delete what you removed from Git (drift) | No — you script deletes by hand | No — orphans linger | Yes — automatic, owner-scoped |
Safe-delete boundary (pacOwnerId) |
None | None | Yes |
| Plan/preview before apply | No native diff | What-If | Build-DeploymentPlans diff |
| Manages DINE/Modify role assignments | Manual | Manual | Deploy-RolesPlan |
| Multi-ring promotion by selector | Hand-rolled | Per-env templates | PacEnvironmentSelector |
| Idempotent re-run | Depends on your scripts | Mostly | Yes |
| Best for | One-off / tiny estates | Mid estates without drift needs | Landing-zone scale, audited |
The global-settings.jsonc ties a selector to a real scope and identity:
{
"pacOwnerId": "f0000000-1111-2222-3333-444444444444",
"pacEnvironments": [
{
"pacSelector": "epac-prod",
"cloud": "AzureCloud",
"tenantId": "<tenant-guid>",
"deploymentRootScope": "/providers/Microsoft.Management/managementGroups/contoso"
}
]
}
pacOwnerId is what makes drift detection safe: EPAC only manages objects it stamped with that owner ID, so it never deletes assignments created by another team or by Microsoft’s built-in policy initiatives. The settings that govern reconciliation behaviour:
global-settings.jsonc key |
Controls | Default / typical | When to change | Risk if wrong |
|---|---|---|---|---|
pacOwnerId |
Which objects EPAC manages/deletes | A unique GUID per repo | Never reuse across repos | Deletes another repo’s objects |
pacSelector |
The environment/ring name | epac-dev/epac-prod |
Per ring | Deploys to the wrong MG |
deploymentRootScope |
The MG the plan targets | Root or intermediate MG | Per ring scope | Over-broad enforcement |
managedIdentityLocation |
Region for MSI-bearing assignments | e.g. eastus |
Match your estate | Identity-bearing deploy fails |
globalNotScopes |
Scopes EPAC never manages | Decommissioned subs | Carve-outs | Manages a scope you meant to exclude |
desiredState.strategy |
full vs ownedOnly deletion |
ownedOnly (safe) |
Rarely → full |
full can delete unowned objects |
In Azure Pipelines, split plan from deploy across stages with an environment approval gate between them — plan on PR, deploy on merge:
stages:
- stage: Plan
jobs:
- job: BuildPlan
steps:
- task: AzureCLI@2
inputs:
azureSubscription: epac-spn # workload identity federation
scriptType: pscore
scriptLocation: inlineScript
inlineScript: |
Build-DeploymentPlans -DefinitionsRootFolder ./policy `
-OutputFolder $(Build.ArtifactStagingDirectory) `
-PacEnvironmentSelector epac-prod
- publish: $(Build.ArtifactStagingDirectory)
artifact: policy-plan
- stage: Deploy
dependsOn: Plan
jobs:
- deployment: ApplyPolicy
environment: policy-prod # add an approval check here
strategy:
runOnce:
deploy:
steps:
- download: current
artifact: policy-plan
- task: AzureCLI@2
inputs:
azureSubscription: epac-spn
scriptType: pscore
scriptLocation: inlineScript
inlineScript: |
Deploy-PolicyPlan -DefinitionsRootFolder ./policy `
-InputFolder $(Pipeline.Workspace)/policy-plan `
-PacEnvironmentSelector epac-prod
The deploying identity needs Resource Policy Contributor at the root management group for policy objects, plus User Access Administrator (or Owner) to create the role assignments DINE identities require. Grant it to the federated service principal, not a human, and gate it behind PR review. The exact roles the pipeline principal needs and why:
| Role | Scope | Why the pipeline needs it | If missing |
|---|---|---|---|
| Resource Policy Contributor | Root MG (or per-ring MG) | Create/update definitions, initiatives, assignments | Policy objects fail to deploy |
| User Access Administrator | Root MG | Create role assignments for DINE/Modify MSIs | Deploy-RolesPlan fails; DINE can’t act |
| Reader (implied by above) | Root MG | Read live state for the plan diff | Plan can’t compute drift |
| Managed Identity Operator (sometimes) | MG/sub | If using user-assigned identities | UAMI-based DINE can’t bind |
The CI/CD platform choice does not change the model — the same plan/deploy split works in GitHub Actions with GitHub Actions + Terraform OIDC plan/PR automation-style federation, or Azure DevOps with the multistage YAML approvals patterns.
Testing before rollout
Three layers of validation, cheapest first. The discipline is to never let a policy reach Deny in production without passing all three.
| Layer | What it catches | Cost | Speed | Where it runs |
|---|---|---|---|---|
| 1. Lint + What-If | Bad JSON shape; what the merge would create/change | Free | Seconds | On PR |
2. Audit ring + scan |
The real-world blast radius (how many resources non-compliant) | Free | Minutes (on-demand scan) | Sandbox MG |
| 3. MG promotion ring | Whether enforcement breaks real teams | Free | Per ring | sandbox → nonprod → prod |
1. Lint and What-If on PR. Validate JSON shape, then run a What-If of the policy deployment to confirm what objects the merge would create or change — without touching production:
# Structural sanity for every definition/initiative JSON
Get-ChildItem ./policy -Recurse -Include *.json |
ForEach-Object { $null = Get-Content $_ -Raw | ConvertFrom-Json }
# What-If the policy artifacts at the management-group scope
az deployment mg what-if \
--management-group-id contoso \
--location eastus \
--template-file ./policy-bicep/assignments.bicep
What-If change types and what each tells you about the merge:
| What-If change type | Means | Safe to merge? | Watch for |
|---|---|---|---|
Create |
A new policy object will be added | Usually | An unexpected duplicate of an existing rule |
Modify |
An existing object’s properties change | Review the diff | A scope or effect change you didn’t intend |
Delete |
An object will be removed | Pause | Drift removal of something still in use |
NoChange |
Already matches desired state | Yes | A clean plan should be mostly this |
Ignore |
Out of scope for this deployment | Yes | — |
2. Assign as Audit in a ring, read compliance. Every effect-parameterized policy ships to a non-prod management group as Audit first. Trigger an on-demand scan and read the result instead of waiting ~24 hours:
# Force an evaluation at a scope, then summarize compliance
az policy state trigger-scan --resource-group rg-sandbox
az policy state summarize \
--management-group mg-sandbox \
--query "results.policyAssignments[].{name:policyAssignmentId, nonCompliant:results.nonCompliantResources}" \
-o table
If Audit flags 4,000 resources, flipping straight to Deny would have broken those teams. The audit count is your blast radius. The compliance states you’ll read and what each demands:
| Compliance state | Meaning | Your next move |
|---|---|---|
Compliant |
Resource satisfies the rule | None |
NonCompliant |
Resource violates the rule | This is the blast radius — remediate or accept before Deny |
NotStarted |
No scan has evaluated it yet | Trigger a scan; don’t conclude it’s broken |
Exempt |
An exemption covers it | Verify the exemption is time-bound and ticketed |
Conflicting |
Two assignments disagree on effect | Resolve overlapping assignments |
Unknown (Manual effect) |
Awaiting human attestation | Attest via the compliance API |
What actually triggers a compliance re-evaluation, and how fast each is — so you know whether to wait or force a scan:
| Trigger | What causes it | Latency | Force it manually? |
|---|---|---|---|
| Resource create/update | Any write to a governed resource | Inline (immediate) | N/A — it’s the write itself |
| New/changed assignment | Assigning or editing a policy | Evaluation kicks off within ~30 min | az policy state trigger-scan |
| Background compliance scan | Platform-scheduled sweep | ~24 hours | az policy state trigger-scan |
| On-demand scan | You request it | Minutes (scope-dependent) | Yes — the one you use in CI |
Remediation ReEvaluateCompliance |
A remediation task with re-scan mode | Per task | Via --resource-discovery-mode |
3. Promote through a management-group ring. Use distinct EPAC environment selectors per ring and promote the same definitions outward:
mg-sandbox -> mg-nonprod -> mg-prod
(Audit) (Audit/Deny) (Deny/DINE)
Same Git, same definitions; only the assignment’s effect parameter and target scope change between selectors. That is the entire value of parameterizing the effect. The ring promotion matrix:
| Ring | EPAC selector | Effect param | Approval gate | What it proves |
|---|---|---|---|---|
| Sandbox | epac-sandbox |
Audit |
None (auto on merge) | The rule is syntactically live; measures reach |
| Nonprod | epac-nonprod |
Audit → Deny |
Team lead | Enforcement doesn’t break realistic workloads |
| Prod | epac-prod |
Deny / DINE |
Change board | The control holds at scale with real traffic |
Remediation at scale
Deny is forward-looking. For existing fleets you need DINE or Modify plus remediation tasks, and at scale the ARM control plane is the bottleneck.
A DINE assignment must declare its identity and the roles it grants. In Bicep:
resource diagAssignment 'Microsoft.Authorization/policyAssignments@2024-04-01' = {
name: 'deploy-diag-to-law'
scope: managementGroup()
location: 'eastus' // required when identity is set
identity: { type: 'SystemAssigned' }
properties: {
policyDefinitionId: tenantResourceId(
'Microsoft.Authorization/policySetDefinitions', 'diagnostics-baseline')
parameters: {
logAnalytics: { value: lawResourceId }
}
}
}
After the identity exists, grant it the roles its template needs (for diagnostics-to-Log-Analytics that is typically Monitoring Contributor + Log Analytics Contributor), then create the remediation task:
# Remediate one initiative member across the assignment's scope
az policy remediation create \
--name remediate-diag-2026q2 \
--management-group contoso \
--policy-assignment deploy-diag-to-law \
--definition-reference-id deployDiagnostics \
--resource-discovery-mode ReEvaluateCompliance
Throttling is the real engineering problem. A remediation task fans out one template deployment per non-compliant resource. Across thousands of resources that hammers ARM, and you will hit
429 Too Many Requests. Control concurrency with--parallel-deployments(how many remediations run at once) and--resource-count(the cap per task), then run multiple smaller, scoped tasks rather than one tenant-wide blast.
az policy remediation create \
--name remediate-diag-batch-01 \
--management-group contoso \
--policy-assignment deploy-diag-to-law \
--definition-reference-id deployDiagnostics \
--parallel-deployments 10 \
--resource-count 500
The remediation knobs, their defaults, and how to reason about each:
| Setting | What it controls | Default | Range / values | When to change |
|---|---|---|---|---|
--parallel-deployments |
Concurrent template deployments | 10 | 1–30 | Lower it the moment you see 429s |
--resource-count |
Max resources fixed per task | 500 | 1–50000 | Cap per landing zone to bound blast |
--resource-discovery-mode |
Whether to re-scan before fixing | ExistingNonCompliant |
ExistingNonCompliant / ReEvaluateCompliance |
ReEvaluate after a definition change |
--location-filters |
Restrict to regions | none | region list | Stage region-by-region |
Scope (--management-group/--resource-group) |
The set of resources targeted | The assignment scope | MG / sub / RG | Narrow to one landing zone per task |
The throttling reality as a sizing table — why one tenant-wide task fails and batches succeed:
| Approach | Deployments issued | ARM pressure | Failure mode | Outcome |
|---|---|---|---|---|
| One tenant-wide task, default concurrency | Thousands at once | Spikes past ARM write limits | 429 mid-run |
Half-fixed fleet, re-flagged next scan |
Per-landing-zone, --resource-count 500 |
≤500 per task | Bounded | Rare; isolated to one LZ | Clean batch; widen next |
Per-LZ + --parallel-deployments 5 after a 429 |
≤500, throttled | Low | Almost none | Slow and steady; fully remediated |
Roll remediation out per landing zone, watch the failure column, and only widen concurrency once a batch lands clean. A remediation task that 429s halfway leaves a half-fixed fleet that the next compliance scan will re-flag — slow and steady wins. The remediation lifecycle states you’ll watch:
| Remediation state | Meaning | Action |
|---|---|---|
Evaluating |
Discovering non-compliant resources | Wait |
InProgress |
Issuing deployments | Watch the failure count |
Succeeded |
All targeted resources remediated | Widen scope / next LZ |
Failed |
One or more deployments failed (often 429) |
Lower concurrency; re-run (idempotent) |
Cancelled |
Manually stopped | Re-create scoped tighter |
Complete (with failures) |
Finished but some resources unfixed | Inspect failures; re-run the remainder |
Exemptions and break-glass
An exemption is the documented exception — and unlike disabling a policy, it is scoped, audited, and can expire on its own. Make every exemption time-bound:
az policy exemption create \
--name "waiver-legacy-sa-encryption" \
--policy-assignment "/providers/Microsoft.Management/managementGroups/contoso/providers/Microsoft.Authorization/policyAssignments/security-baseline" \
--exemption-category Waiver \
--scope "/subscriptions/<legacy-sub-id>" \
--policy-definition-reference-ids denyStoragePublic \
--expires-on "2026-09-30T23:59:59Z" \
--description "INC-4821: legacy app migrating to managed identity; owner: platform-team"
Two categories exist: Waiver (you accept the risk and are not fixing it now) and Mitigated (the risk is handled by a compensating control outside policy). Putting the ticket number and owner in --description is what turns an exemption from a hole into an auditable decision. Commit exemption JSON to policy/policy-exemptions/<env>/ so EPAC manages their lifecycle and removes them from Azure the instant they leave Git.
The two categories and when each is honest:
| Category | Means | Use when | Audit expectation |
|---|---|---|---|
Waiver |
Risk accepted, not remediating now | A dated migration is underway | A ticket + a real expiry + an owner |
Mitigated |
Risk handled by a control outside policy | A compensating control covers it | A pointer to the compensating control |
The exemption fields that make it auditable vs a silent hole:
| Field | Purpose | Make it… | Smell if… |
|---|---|---|---|
--expires-on |
Auto-revoke date | Always set | Omitted → permanent hole |
--description |
The why + ticket + owner | INC-####: reason; owner: |
Empty or “temp” |
--exemption-category |
Waiver vs Mitigated | Honest about the situation | Always Waiver with no plan |
--policy-definition-reference-ids |
Narrow to specific rules in an initiative | Scope to the one rule | Exempting the whole initiative |
--scope |
The narrowest scope that works | Resource, not subscription | Subscription-wide for one resource |
| Git location | Lifecycle management | In policy-exemptions/<env>/ |
Created in the portal, untracked |
For break-glass, never delete a policy assignment to unblock an incident — that silently removes the guardrail for everyone and leaves no trail. Instead, create a tightly scoped, short-expiresOn exemption through the emergency-change path, and let it self-revoke. The break-glass decision table:
| Incident pressure | Wrong move | Right move | Why |
|---|---|---|---|
| “This deploy is blocked, prod is down” | Delete the assignment | Scoped exemption, expiry hours away | Keeps the guardrail for everyone else; leaves a trail |
| “Disable the whole initiative” | Set effect Disabled |
Exempt the one resource + rule | Disabled removes the control silently |
| “Just give the team Owner” | Broaden RBAC | Time-bound exemption | Exemption is reversible and audited |
| “We’ll clean it up later” | Permanent exemption, no expiry | expiresOn + ticket |
Sprawl is the failure mode |
Architecture at a glance
The diagram traces the policy-as-code path the way it actually runs, left to right, and marks the five places it most often breaks. Read it as a pipeline: an author opens a PR in the Git repo (definitions, initiatives, assignments, exemptions). The CI/CD pipeline runs Build-DeploymentPlans to produce a plan artifact on PR, then — behind an approval gate — Deploy-PolicyPlan and Deploy-RolesPlan on merge, authenticating as a federated service principal that holds Resource Policy Contributor + User Access Administrator at the root MG. EPAC writes into the Azure control plane: custom definitions and initiatives at the management-group scope, assignments that carry a system-assigned managed identity for Modify/DINE, and exemptions down to the resource. Finally the target estate — subscriptions and resource groups under the MG hierarchy — is where enforcement bites: Deny intercepts new writes, Modify/DINE plus remediation tasks fix the existing fleet in throttled batches, and the compliance scan feeds results back to the control plane.
Follow the numbered badges to read the failure map. Badge ① on the pipeline marks drift — EPAC’s plan wants to delete a live object because someone changed it in the portal; the pacOwnerId boundary is what keeps that deletion safe. Badge ② on the assignment node marks the MSI replication lag that surfaces as PrincipalNotFound when Deploy-RolesPlan outruns Azure AD. Badge ③ on the DINE/Modify node marks a wrong existenceCondition that redeploys every scan and burns quota. Badge ④ on the remediation path marks the 429 storm from an unthrottled tenant-wide task. Badge ⑤ on the exemptions node marks exemption sprawl — un-expiring, untracked waivers that make compliance lie. Every path converges on the same proof: a clean Build-DeploymentPlans re-run reporting no changes means Git and Azure are in sync.
Real-world scenario
A platform team at Northwind Logistics runs ~600 subscriptions under a single root management group, governed by a small set of custom initiatives. The team is five engineers; the governance estate had grown organically in the portal and nobody could answer audit questions cleanly, so they moved it to EPAC with a plan/deploy pipeline and the standard sandbox → nonprod → prod rings.
The migration itself went smoothly. The trouble started with a new control: a Modify policy to enforce a cost-center tag, parameterized Audit → Deny per ring as usual. Sandbox and non-prod were clean — the Audit ring flagged ~3,100 untagged resources, which the team triaged and accepted as the remediation backlog. Then the production deploy failed every assignment with The policy assignment ... does not have the required role assignments — even though Deploy-RolesPlan had run successfully in the same pipeline.
The breakthrough came from asking what was different about scale. Modify and DINE identities are system-assigned, so the principal does not exist until the assignment is created. EPAC creates assignments in Deploy-PolicyPlan, then grants roles in Deploy-RolesPlan — but Azure AD replication of each new service principal lags by seconds to minutes. At 600 assignments, role creation outran replication: the roleAssignments PUT hit principals that were not yet visible tenant-wide, and Azure surfaced it as PrincipalNotFound wrapped in the generic “required role assignments” policy error. In sandbox with a dozen assignments, replication always finished first, which is why it never reproduced below production scale.
The fix was ordering plus idempotent retry, not more permissions. The team split the two deploys into separate pipeline stages with a deliberate gap, and let EPAC’s own retry reconcile the stragglers:
- stage: DeployRoles
dependsOn: DeployPolicy
jobs:
- deployment: ApplyRoles
environment: policy-prod
strategy:
runOnce:
deploy:
steps:
- download: current
artifact: policy-plan
- pwsh: Start-Sleep -Seconds 120 # let AAD replicate new MSIs
- task: AzureCLI@2
inputs:
azureSubscription: epac-spn
scriptType: pscore
scriptLocation: inlineScript
inlineScript: |
Deploy-RolesPlan -DefinitionsRootFolder ./policy `
-InputFolder $(Pipeline.Workspace)/policy-plan `
-PacEnvironmentSelector epac-prod
A re-run of Deploy-RolesPlan is a no-op for already-granted identities, so the second pass only cleans up what replication missed — without re-deploying a single policy object. With the gap in place, the prod deploy went green, and the team then remediated the ~3,100-resource tag backlog in per-landing-zone batches of 500 at --parallel-deployments 10, widening only after each batch landed clean. The whole estate reached compliant over a week with zero 429-induced half-states. The lesson on the wall: “At scale, the bug is rarely permissions — it’s that you raced a distributed system. Order the stages and let idempotency clean up the lag.”
The incident as a timeline, because the order of moves is the lesson:
| Time | Symptom | Action taken | Effect | What it should have been |
|---|---|---|---|---|
| Day 1 | Sandbox/nonprod clean | Ship Audit, read count |
3,100 flagged — backlog known | Correct |
| Day 2, 10:00 | Prod fails every assignment | Re-run Deploy-RolesPlan |
Same error | Ask: what changed at scale? |
| 10:30 | Still failing | Add more roles to the SPN | No change (already had them) | Not a permissions problem |
| 11:15 | Root cause found | Recognize MSI replication lag at 600 assignments | Two coupled facts: system-assigned MSI + AAD lag | — |
| 12:00 | Mitigated | Split stages + 120s gap + idempotent retry | Prod goes green | Correct fix |
| +1 week | Fully governed | Per-LZ remediation, 500 × 10 concurrency | 0 429 half-states, all compliant |
The actual fix is batching |
Advantages and disadvantages
The git-driven, EPAC-reconciled model both fixes the unreviewability of portal governance and introduces its own operational edges. Weigh it honestly:
| Advantages (why this model helps you) | Disadvantages (why it bites) |
|---|---|
| Every policy change is a PR with a reviewer, a diff, and a recorded why | Standing up the pipeline (EPAC, federation, rings) is real upfront effort |
| EPAC drift detection flags and removes orphaned assignments — the estate stays in sync with Git | Drift removal is dangerous if pacOwnerId/desiredState is misconfigured — it can delete live objects |
One definition promotes Audit → Deny across rings by flipping a parameter — no code change |
Parameterizing everything adds indirection; a junior reader can’t see the effective effect at a glance |
What-If + an Audit ring make the blast radius measurable before enforcement |
The compliance scan lag (~24h) means feedback isn’t instant unless you trigger scans |
| Remediation tasks fix existing fleets at scale | Unthrottled remediation 429s and leaves half-fixed states; throttling is your responsibility |
| Exemptions in Git are time-bound, ticketed, and lifecycle-managed | A portal exemption created out-of-band becomes untracked drift the moment it’s made |
| Built-in initiatives compose with custom ones under one roll-up | Overlapping assignments can produce Conflicting compliance that’s confusing to resolve |
The model is right for any team at landing-zone scale that must prove governance — regulated industries, multi-subscription estates, anyone facing audits. It is overkill for a single subscription with three policies, where the portal is genuinely faster. The disadvantages are all manageable — a correct pacOwnerId, throttled remediation, exemptions-in-Git — but only if you know they exist, which is the point of this article.
Hands-on lab
Stand up a minimal policy-as-code loop without EPAC or a management group, so it runs free in any subscription: author a custom definition, assign it as Audit at a resource-group scope, trigger a scan, read compliance, then flip to Deny and watch it block. Run in Cloud Shell (Bash). Teardown at the end.
Step 1 — Variables and a sandbox resource group.
SUB=$(az account show --query id -o tsv)
RG=rg-policy-lab
LOC=centralindia
az group create -n $RG -l $LOC -o table
Step 2 — Author a custom definition (deny public blob access), parameterized effect.
cat > rule.json <<'JSON'
{
"if": { "allOf": [
{ "field": "type", "equals": "Microsoft.Storage/storageAccounts" },
{ "field": "Microsoft.Storage/storageAccounts/allowBlobPublicAccess", "notEquals": false }
]},
"then": { "effect": "[parameters('effect')]" }
}
JSON
cat > params.json <<'JSON'
{ "effect": { "type": "String", "defaultValue": "Audit",
"allowedValues": ["Audit","Deny","Disabled"] } }
JSON
az policy definition create \
--name "lab-deny-public-blob" \
--display-name "Lab: deny public blob access" \
--mode Indexed \
--rules @rule.json \
--params @params.json \
--subscription $SUB -o table
Expected: a definition row with policyType = Custom.
Step 3 — Assign it as Audit at the resource-group scope.
az policy assignment create \
--name "lab-audit-public-blob" \
--policy "lab-deny-public-blob" \
--scope "/subscriptions/$SUB/resourceGroups/$RG" \
--params '{ "effect": { "value": "Audit" } }' -o table
Step 4 — Create a deliberately non-compliant storage account, then scan.
SA=stpolicylab$RANDOM
az storage account create -n $SA -g $RG -l $LOC --sku Standard_LRS \
--allow-blob-public-access true -o table # intentionally non-compliant
az policy state trigger-scan --resource-group $RG # ~1-2 min
az policy state summarize --resource-group $RG \
--query "results.policyAssignments[].{name:policyAssignmentId, nonCompliant:results.nonCompliantResources}" \
-o table
Expected after the scan: nonCompliant: 1 — the storage account is flagged but not blocked, because the effect is Audit. That count is your blast radius.
Step 5 — Flip the assignment to Deny and prove it blocks.
az policy assignment update \
--name "lab-audit-public-blob" \
--scope "/subscriptions/$SUB/resourceGroups/$RG" \
--params '{ "effect": { "value": "Deny" } }' -o table
# Now try to create another public storage account — it should be REJECTED
az storage account create -n stpolicylab$RANDOM -g $RG -l $LOC --sku Standard_LRS \
--allow-blob-public-access true 2>&1 | grep -i "disallowed\|RequestDisallowedByPolicy" \
|| echo "If you see a policy denial above, Deny is working."
Expected: the create fails with RequestDisallowedByPolicy naming the assignment. Note that the existing public account from Step 4 is still there — Deny is forward-only, which is exactly why a fleet needs Modify/DINE + remediation.
Step 6 — (Optional) Add a time-bound exemption for the legacy account.
az policy exemption create \
--name "lab-waiver" \
--policy-assignment "/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.Authorization/policyAssignments/lab-audit-public-blob" \
--exemption-category Waiver \
--scope "/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.Storage/storageAccounts/$SA" \
--expires-on "2026-12-31T23:59:59Z" \
--description "LAB-001: demo waiver; owner: you" -o table
Step 7 — Teardown (delete everything so there’s no spend or lingering policy).
az policy exemption delete --name "lab-waiver" \
--scope "/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.Storage/storageAccounts/$SA" 2>/dev/null
az policy assignment delete --name "lab-audit-public-blob" \
--scope "/subscriptions/$SUB/resourceGroups/$RG"
az policy definition delete --name "lab-deny-public-blob" --subscription $SUB
az group delete -n $RG --yes --no-wait
The lab steps and what each proves:
| Step | What you did | What it proves |
|---|---|---|
| 2 | Authored a parameterized definition | Effect is a knob, not a constant |
| 3 | Assigned as Audit |
Same definition, scope chosen at assignment |
| 4 | Created a bad resource + scanned | Audit measures without blocking |
| 5 | Flipped to Deny |
Enforcement blocks new violations only |
| 6 | Added a time-bound exemption | The documented, expiring exception |
| 7 | Deleted everything | Clean teardown; no lingering guardrails |
Common mistakes & troubleshooting
The differentiator. Each failure mode below is symptom → root cause → how to confirm (exact command) → fix. First the playbook table you scan mid-rollout, then the detail on the ones that need it.
| # | Symptom | Root cause | Confirm (exact command / path) | Fix |
|---|---|---|---|---|
| 1 | EPAC plan wants to Delete a live assignment |
Drift — object changed/created in the portal, not in Git | Build-DeploymentPlans → read the plan’s deletes |
Re-import to Git, or confirm pacOwnerId ownership before allowing the delete |
| 2 | Deny blocked 4,000 deploys on day one |
Skipped the Audit ring |
az policy state summarize shows the count you never read |
Roll back to Audit; promote after triage |
| 3 | “does not have the required role assignments” | DINE/Modify MSI replication lag (or missing role) | az policy assignment show --query identity; az role assignment list --assignee <principalId> |
Split deploy/roles stages + gap; re-run Deploy-RolesPlan |
| 4 | DINE redeploys its template every scan | Wrong existenceCondition — never matches an already-compliant resource |
az policy state list shows perpetual non-compliance on compliant resources |
Fix the existenceCondition; test against a known-good resource |
| 5 | Remediation Failed halfway, fleet half-fixed |
429 Too Many Requests from an unthrottled tenant-wide task |
Remediation task → failure count; activity log 429s |
Lower --parallel-deployments/--resource-count; scope per LZ; re-run (idempotent) |
| 6 | Identity-bearing assignment fails to deploy | Missing location on a Modify/DINE assignment |
Deploy error names a missing region | Add location: to the assignment |
| 7 | Compliance dashboard empty / NotStarted |
No scan has run since deploy | az policy state summarize shows NotStarted |
az policy state trigger-scan; wait the scan window |
| 8 | Policy “does nothing” on a property | No alias exists, or field typo’d |
az provider show --expand resourceTypes/aliases lacks the alias |
Use an existing alias, or pick a different enforcement point |
| 9 | Two assignments fight; compliance Conflicting |
Overlapping assignments with different effects | az policy assignment list --scope ... shows both |
Consolidate; one initiative per control |
| 10 | Exemption “covers” a resource but it’s still flagged | Wrong policy-definition-reference-ids or scope |
az policy exemption show vs the failing definition ref |
Match the exact reference ID + narrowest scope |
| 11 | Effect changed in Git but Azure still old | Plan not re-run, or wrong selector deployed | Build-DeploymentPlans diff shows the change as pending |
Re-run plan/deploy with the correct PacEnvironmentSelector |
| 12 | Modify doesn’t change existing resources |
Modify is write-time; existing fleet needs remediation |
az policy state list shows them still non-compliant |
Create a remediation task for the Modify assignment |
Drift wants to delete a live object (#1)
EPAC’s whole value is reconciliation, which means its plan will propose deleting anything stamped with your pacOwnerId that isn’t in Git. Confirm: read the Delete entries in the Build-DeploymentPlans output and check whether the object carries your owner stamp. Fix: if it’s a legitimate object someone created in the portal, import it back into the repo so Git becomes the source of truth; if it genuinely should go, let the plan remove it. Never set desiredState.strategy to full unless you have deliberately decided EPAC owns every policy object under the scope — full will delete unowned objects too.
“Required role assignments” at scale (#3)
The production-scale classic from the scenario. Confirm the identity exists and was granted its roles:
PRINCIPAL=$(az policy assignment show --name deploy-diag-to-law \
--scope /providers/Microsoft.Management/managementGroups/contoso \
--query identity.principalId -o tsv)
az role assignment list --assignee "$PRINCIPAL" -o table # empty during replication lag
Fix: split Deploy-PolicyPlan and Deploy-RolesPlan into separate stages with a deliberate gap so Azure AD finishes replicating the new system-assigned principals, then let EPAC’s idempotent retry reconcile any stragglers. A re-run of Deploy-RolesPlan is a no-op for already-granted identities.
A wrong DINE existenceCondition (#4)
DINE evaluates an existenceCondition to decide whether the companion resource already exists. If that condition can never match an already-compliant resource, the engine concludes the resource is missing on every scan and redeploys the template forever — noisy and expensive. Confirm: a resource you know has diagnostics configured still shows NonCompliant. Fix: test the existenceCondition against a known-good resource and confirm it reports compliant before you assign at scale.
# Are resources you believe are compliant still flagged non-compliant? (smell test)
az policy state list --resource-group rg-known-good \
--query "[?complianceState=='NonCompliant'].{res:resourceId, policy:policyDefinitionName}" -o table
The 429 remediation storm (#5)
A remediation task fans out one deployment per non-compliant resource. Confirm: the task state is Failed/Complete with failures, and the activity log shows 429 Too Many Requests:
az monitor activity-log list --offset 1h \
--query "[?contains(to_string(httpRequest), '429') || status.value=='Failed'].{op:operationName.value, status:status.value, time:eventTimestamp}" \
-o table
Fix: lower --parallel-deployments and --resource-count, scope the task to one landing zone, and re-run — remediation is idempotent, so the re-run only fixes what’s still non-compliant. Widen concurrency only after a batch lands clean.
Best practices
Crisp, production-grade rules — most of these are the difference between a governed estate and a pile of orphaned assignments.
- Keep the four object types in separate Git trees, named by capability, never by scope. A scope-named definition can never be promoted or reused.
- Always assign initiatives, not bare definitions — even an initiative of one. Retrofitting an initiative later means re-creating assignments and losing compliance history.
- Parameterize the effect so the same definition ships
AuditthenDenyper ring. The effect should live on the assignment, not in the definition. - Give EPAC a unique
pacOwnerIdper repo and leavedesiredState.strategyat the safeownedOnlyunless you have deliberately decided to own everything. - Split the pipeline into plan (PR) and deploy (merge) with an approval gate, and split
Deploy-PolicyPlanfromDeploy-RolesPlanwith a deliberate gap at scale. - Never enforce
Denywithout first reading theAuditcount in a sandbox MG. The audit count is your blast radius. - Pair every
Denywith a remediation plan (Modify/DINE + remediation task) for anything that already exists —Denyis forward-only. - Throttle remediation and scope it per landing zone. Run small, watch the failure column, widen only on clean batches.
- Make every exemption time-bound, ticket-referenced, and version-controlled. Commit them so EPAC removes them the instant they leave Git.
- Grant policy permissions to a federated service principal, never a human, and gate it behind PR review.
- Test a DINE
existenceConditionagainst a known-good resource before assigning at scale, or it redeploys forever. - Wire compliance roll-up to a dashboard and run a nightly
Build-DeploymentPlansas scheduled drift detection — a no-op plan is your proof of parity.
Security notes
Policy-as-code is a security control, and its own attack surface is the deploying identity and the drift boundary.
- Least privilege for the pipeline principal. It needs exactly Resource Policy Contributor (policy objects) and User Access Administrator (role assignments for DINE/Modify MSIs) at the root MG — nothing broader. UAA is powerful (it can grant any role), so scope it to the policy root MG and gate every change behind PR review. Treat this principal the way you’d treat any high-privilege identity in Entra RBAC governance.
- Federated, not secret-based, auth. Use workload identity federation (OIDC) for the pipeline so there is no client secret to leak — the same secretless pattern as GitHub Actions + Terraform OIDC.
- The
pacOwnerIdis a safety boundary, not a secret. Its job is to stop EPAC deleting objects it doesn’t own; misconfiguring it (or settingdesiredStatetofull) is the most dangerous mistake in the whole pipeline because it can remove live guardrails. - DINE/Modify identities are themselves privileged. A DINE policy that deploys diagnostics gets Monitoring Contributor; one that configures backup gets backup roles. Enumerate exactly what each assignment’s
roleDefinitionIdsgrant and scope them to the assignment’s MG — an over-granted DINE identity is a lateral-movement path. - Exemptions are security exceptions — treat them as such. An un-expiring
Waiveris an open hole. Require a ticket, an owner, an expiry, and the narrowest scope; review the live exemption list as part of every audit. - Don’t leak resource internals in
displayName/description. Policy metadata is broadly readable; keep secrets and sensitive identifiers out of it. - Protect the most destructive resources with
DenyActionon delete where appropriate, and never use a blanketDisabledto silence a control — an exemption is the auditable equivalent.
Cost & sizing
Azure Policy itself is free — there is no charge for definitions, assignments, evaluations, or compliance scans. The bill comes from what your policies cause to be deployed and from how you remediate. Get these wrong and a governance pipeline quietly runs up a Log Analytics and ARM bill.
| Cost driver | What it is | Rough magnitude | How to control it |
|---|---|---|---|
| Azure Policy service | Definitions, assignments, scans | Free | N/A — never the cost |
| DINE-deployed resources | Diagnostics → Log Analytics ingestion | Per-GB ingested; can dominate at fleet scale | Scope diagnostics; sample; tier the workspace |
Wrong existenceCondition redeploys |
Template redeployed every scan | Wasted ARM ops + any resource cost | Fix the condition (mistake #4) |
| Remediation deployments | One deployment per resource | Compute/time, not a per-deploy fee | Batch + throttle; one-time backlog |
| Log Analytics for compliance | Storing compliance/activity logs | Per-GB + retention | Right-size retention; archive tier |
| Pipeline agent minutes | CI/CD running EPAC | Cheap (minutes per run) | Run plan on PR, deploy on merge only |
The DINE-to-Log-Analytics path is where real money hides: a deployIfNotExists that turns on every diagnostic category for every resource across 600 subscriptions can ingest enormous volumes. Size it deliberately — pick the categories you actually query, consider sampling, and route to a workspace tiered for the volume, the same discipline as in Azure Monitor & Application Insights observability. In INR terms, the policy pipeline’s own footprint is negligible (pipeline minutes, a few rupees a run); the variable cost is entirely the ingestion and retention your DINE policies generate, which can run from near-zero on a small estate to lakhs per month if you onboard full diagnostics fleet-wide without sampling.
Sizing the rollout itself — how long and how risky each phase is:
| Phase | Effort / duration | Cost | Risk if rushed |
|---|---|---|---|
| Stand up EPAC + pipeline | Days (one-time) | Negligible | Misconfigured pacOwnerId |
| Author + lint definitions | Hours per control | Free | Bad alias / over-broad rule |
Audit ring + read blast radius |
Minutes per control (+ scan window) | Free | Skipping it → day-one Deny outage |
Promote to Deny/DINE |
Per ring, gated | Free (policy) | Enforcement breaks teams |
| Remediate existing fleet | Days (throttled batches) | ARM time + downstream ingestion | 429 half-states; ingestion blowout |
Interview & exam questions
Mapped to AZ-104 (governance), AZ-305 (design governance), and the AZ-500/SC-100 security-design angle. Which exam emphasises which slice of this topic:
| Exam | What it tests on policy-as-code | The questions below that map |
|---|---|---|
| AZ-104 | Create/assign policy + initiatives; remediation basics; exemptions | Q1, Q2, Q3, Q11 |
| AZ-305 | Design governance: MG hierarchy, ring promotion, effect choice at scale | Q1, Q7, Q9, Q12 |
| AZ-500 | Security guardrails, least-privilege deploy identity, DenyAction | Q8, Q11 |
| SC-100 | Governance strategy, exemption discipline, audit posture | Q6, Q11, Q12 |
| AZ-400 | The CI/CD pipeline, gates, idempotent deploy, drift detection | Q6, Q9, Q10, Q12 |
Q1. What are the four Azure Policy object types and how do they differ? Definition (a single if/then rule), initiative/policy set (a bundle of definitions with hoisted parameters), assignment (a definition or initiative bound to a scope with parameter values), and exemption (a time-bound, audited waiver). Definitions/initiatives describe capability; assignments describe where it’s enforced; exemptions describe the documented exception.
Q2. When does a policy evaluate, and which effects can’t run inline? On resource create/update (intercepted before commit) and on a ~24-hour background compliance scan. auditIfNotExists and deployIfNotExists only run on writes and on the scan — never inline — because they must inspect related resources that already exist.
Q3. Why can’t Deny fix an existing non-compliant fleet, and what do you use instead? Deny only blocks new non-compliant writes; it never touches resources that already exist. To fix the existing fleet you use Modify or DeployIfNotExists plus a remediation task, which fans out one deployment per non-compliant resource.
Q4. What’s the difference between field and value in a policy rule? field reads an alias-mapped property of the target resource and can drive deny/modify into the request payload; value evaluates an arbitrary expression (a parameter, a template function) independent of the resource and cannot reach into the payload.
Q5. Why must you enumerate aliases before writing a rule? A policy can only target a property that has an alias exposed by the resource provider. If no alias exists for the property, you cannot write a field condition against it — so you list aliases (az provider show --expand resourceTypes/aliases) first and pick a different enforcement point if needed.
Q6. What does pacOwnerId do in EPAC and why is it a safety mechanism? It stamps every object EPAC creates so the tool only manages and deletes objects bearing that ID. This makes drift removal safe — EPAC never deletes another team’s assignments or Microsoft’s built-in initiatives.
Q7. Why parameterize the effect on the assignment instead of hard-coding it in the definition? So the same definition can ship Audit in sandbox and Deny/DINE in prod by changing only the assignment’s parameter and scope — promotion through rings becomes a parameter flip, not a code change.
Q8. What permissions does the policy-deploying principal need, and why two? Resource Policy Contributor at the root MG to create policy objects, plus User Access Administrator (or Owner) to create the role assignments that DINE/Modify managed identities require. Grant both to a federated service principal, never a human.
Q9. Why do Modify/DINE assignments fail with “required role assignments” at large scale, and how do you fix it? Their identities are system-assigned, so the principal doesn’t exist until the assignment is created; at hundreds of assignments, role creation can outrun Azure AD replication and hit PrincipalNotFound. Fix by splitting deploy and role stages with a gap and relying on EPAC’s idempotent retry.
Q10. How do you remediate thousands of resources without a 429 storm? Throttle with --parallel-deployments and --resource-count, scope tasks per landing zone rather than tenant-wide, watch the failure column, and widen concurrency only after a batch lands clean. Remediation is idempotent, so re-runs only fix what’s still non-compliant.
Q11. What’s the difference between disabling a policy and exempting a resource? Disabled (or setting the effect to Disabled) silently removes the control for everyone with no scope, expiry, or trail. An exemption is scoped to specific resources/rules, carries a category, expiry, and description, is audited, and self-revokes — the reviewable equivalent.
Q12. How do you prove that Git and Azure are actually in sync? Re-run Build-DeploymentPlans after a deploy; a plan reporting no changes is the proof of parity. A nightly plan run is your scheduled drift detection.
Quick check
- Which of the four object types is the only one that carries a scope and parameter values?
- You ship a new tag-enforcement control. What effect do you use first, and what does its result number tell you?
- A
DeployIfNotExistspolicy redeploys its template on every compliance scan. What is almost certainly wrong? - Your production policy deploy fails with “does not have the required role assignments” at 600 assignments, even though
Deploy-RolesPlanran. What’s the cause? - A remediation task ends in
Failedwith the fleet half-fixed. What single setting do you change first, and what makes the re-run safe?
Answers
- The assignment — definitions and initiatives describe capability; the assignment binds them to a scope with concrete parameter values.
Audit. The non-compliant count is your blast radius — flipping straight toDenywould have broken exactly that many deployments.- The
existenceConditionis wrong — it never matches an already-compliant resource, so the engine thinks the companion is missing every scan. Test it against a known-good resource. - Managed-identity replication lag: system-assigned identities don’t exist until the assignment is created, and at scale role creation outruns Azure AD replication (
PrincipalNotFound). Split the deploy/roles stages with a gap and let EPAC’s idempotent retry reconcile. - Lower
--parallel-deployments(and/or--resource-count) and scope per landing zone; the re-run is safe because remediation is idempotent — it only fixes what’s still non-compliant.
Glossary
- Policy definition — A single governance rule (
if/then) authored once and assignable to any scope. - Initiative (policy set) — A bundle of definitions with hoisted parameters, giving one compliance roll-up per business control.
- Assignment — A definition or initiative bound to a scope (MG/subscription/RG) with concrete parameter values; the only object that knows where enforcement happens.
- Exemption — A time-bound, audited waiver of an assignment, scoped down to a resource; either
WaiverorMitigated. - Effect — What a policy does when its rule matches:
Audit,Deny,Modify,Append,DeployIfNotExists,AuditIfNotExists,Manual,Disabled,DenyAction. - Alias — A mapping from a policy
fieldto a resource provider’s real property path; without one, you can’t write policy against that property. - Mode — Which resource types a definition evaluates:
Indexed(taggable/locatable resources) orAll(plus RGs/subscriptions), with data-plane modes for AKS/Key Vault. fieldvsvalue—fieldreads the target resource’s (alias-mapped) property and can drive payload mutation;valueevaluates an arbitrary expression independent of the resource.- EPAC (Enterprise Policy as Code) — A PowerShell module that reads a policy repo, builds a plan, and applies it idempotently with drift detection.
pacOwnerId— The GUID EPAC stamps on objects it manages, so drift removal only ever touches objects it owns.existenceCondition— The DINE/AINE check for whether a related resource already exists; a wrong one causes perpetual redeploys.- Remediation task — A bulk operation that fans out one template deployment per non-compliant resource to fix an existing fleet.
- Compliance scan — The ~24-hour background re-evaluation of existing resources that refreshes audit/DINE/AINE state.
- What-If — A preview of what a deployment would create, modify, or delete, without changing anything — the cheapest pre-merge validation.
- Management group (MG) — A container above subscriptions through which policy and RBAC inherit downward; the usual policy deployment root scope.
Next steps
- Azure Policy governance at scale — the conceptual model this pipeline automates, including initiative design and compliance roll-ups.
- Enterprise-scale management-group hierarchy design — how to shape the MG rings this pipeline promotes through.
- Bicep deployment stacks, What-If & CI — the validation and deployment mechanics that pair with the policy pipeline.
- Azure DevOps YAML multistage environments & approvals — the gate-and-promotion patterns for the plan/deploy split.
- Gatekeeper / OPA policy as code for admission control — the in-cluster equivalent for AKS, which Azure Policy’s
Microsoft.Kubernetes.Datamode integrates with.