Azure Governance

FinOps on Azure: From Cost Visibility to Engineered Savings

Cloud cost is an engineering output, not a finance report you receive after the fact. FinOps is the discipline that gives the people who create spend the data and levers to own it. This playbook treats Azure cost the way you’d treat latency or error rate: instrument it, attribute it, set targets, and automate the corrective action.

The FinOps operating model, mapped to Azure

The FinOps Foundation describes three iterative phases. Each maps cleanly to native Azure tooling:

Phase Goal Azure tooling
Inform Visibility, allocation, showback Cost Management, tags, exports, Cost Analysis
Optimize Rightsizing, commitments Advisor, Reservations, Savings Plans, spot
Operate Continuous governance Azure Policy, Budgets, Automation, action groups

The mistake teams make is living permanently in Inform — buying a dashboard and calling it FinOps. The dashboard is table stakes. The value is in closing the loop into Optimize and Operate, repeatedly.

Step 1 — A tag taxonomy that survives

Allocation is impossible without consistent metadata. Decide on a small, mandatory set of tag keys and treat anything else as optional. Keys are case-insensitive for lookup but case-preserving in the portal, so pick one canonical casing and enforce it.

Tag key Example Used for
CostCenter CC-4412 Chargeback to a finance code
Owner team-payments Routing alerts and cleanup
Environment prod / dev / test Splitting prod vs. non-prod spend
Application checkout-api Unit economics per service

Two things make tags actually stick:

Inheritance. Resources do not inherit tags from their resource group automatically. Use the built-in modify-effect policies to inherit a tag from the resource group when it is missing on the resource. The well-known definition IDs are stable:

# Inherit "CostCenter" from the resource group when absent on the resource
az policy assignment create \
  --name "inherit-costcenter" \
  --scope "/providers/Microsoft.Management/managementGroups/landing-zones" \
  --policy "cd3aa116-8754-49c9-a813-ad46512ece54" \
  --params '{"tagName":{"value":"CostCenter"}}' \
  --mi-system-assigned --location eastus \
  --role "Contributor"

The modify effect needs a managed identity with rights to write tags, hence --mi-system-assigned and a role assignment. Then run a remediation task so the policy fixes existing resources, not just new ones.

Enforcement. For the keys you cannot live without, deny the create:

# Require the "Owner" tag on every new resource
az policy assignment create \
  --name "require-owner-tag" \
  --scope "/providers/Microsoft.Management/managementGroups/landing-zones" \
  --policy "871b6d14-10aa-478d-b590-94f262ecfa99" \
  --params '{"tagName":{"value":"Owner"}}'

Assign these at the management-group scope so every current and future subscription inherits them on day one.

Step 2 — Cost allocation, showback, and exports

With tags flowing, allocation becomes a grouping operation. For ad-hoc analysis, group Cost Analysis by tag. For anything recurring, do not scrape the portal — push data out.

Scheduled exports write amortized cost (commitments spread across their term, the only honest view) to a storage account as daily CSV/Parquet:

az costmanagement export create \
  --name "daily-amortized" \
  --scope "/providers/Microsoft.Management/managementGroups/contoso" \
  --type AmortizedCost \
  --dataset-granularity Daily \
  --storage-account-id "$SA_ID" \
  --storage-container "cost-exports" \
  --storage-directory "amortized" \
  --recurrence Daily \
  --recurrence-period from="2026-06-01T00:00:00Z" to="2027-06-01T00:00:00Z"

For interactive queries, the Cost Management Query API aggregates server-side so you transfer summaries, not raw line items:

az rest --method post \
  --url "https://management.azure.com/providers/Microsoft.Management/managementGroups/contoso/providers/Microsoft.CostManagement/query?api-version=2023-11-01" \
  --body '{
    "type": "AmortizedCost",
    "timeframe": "MonthToDate",
    "dataset": {
      "granularity": "None",
      "aggregation": { "totalCost": { "name": "CostUSD", "function": "Sum" } },
      "grouping": [
        { "type": "TagKey", "name": "Application" },
        { "type": "Dimension", "name": "ResourceGroupName" }
      ]
    }
  }'

Use AmortizedCost, not ActualCost, for showback. ActualCost dumps the entire upfront Reservation charge on the purchase day, which makes a team look like it tripled its spend for one month and then went to zero. Amortized spreads it across the term so trends are real.

Shared-cost splitting. Some spend (hub firewall, Log Analytics ingestion, shared AKS system pools) has no single owner. Don’t leave it in an “unallocated” bucket — define a cost allocation rule that distributes shared resource groups to consumers proportionally, by even split or by a custom percentage. Configure these under Cost Management > Cost allocation; the split then appears natively in Cost Analysis and the Query API so showback reconciles to 100%.

Step 3 — Budgets, anomaly detection, and alert routing

A budget in Azure is not a hard cap; it is a tripwire. The point is to route the signal to the owner, not a central inbox nobody reads.

Wire an action group (email, webhook, Logic App, or Azure Function) and attach a programmatic budget with multiple thresholds, including a forecasted trigger that fires before you actually overspend:

AG_ID=$(az monitor action-group create \
  --name "ag-finops-payments" \
  --resource-group rg-finops \
  --short-name "finops" \
  --action email payments-lead lead@contoso.com \
  --query id -o tsv)

az consumption budget create \
  --budget-name "payments-monthly" \
  --amount 25000 \
  --category Cost \
  --time-grain Monthly \
  --start-date 2026-06-01 \
  --end-date 2027-06-01 \
  --resource-group rg-payments \
  --notifications '{
    "actual80":   {"enabled":true,"operator":"GreaterThan","threshold":80,
                   "contactGroups":["'"$AG_ID"'"],"thresholdType":"Actual"},
    "forecast100":{"enabled":true,"operator":"GreaterThan","threshold":100,
                   "contactGroups":["'"$AG_ID"'"],"thresholdType":"Forecasted"}
  }'

For spikes that a static threshold misses, turn on cost anomaly detection in Cost Management. It models a subscription’s normal daily pattern and flags statistically significant deviations; subscribe an anomaly alert so the owner hears about a runaway batch job in hours, not at month-end invoice.

Step 4 — Commitment strategy and the break-even math

Pay-as-you-go is the most expensive way to run a steady workload. Three commitment vehicles, each with a different trade-off:

Vehicle Flexibility Best for Typical discount
Reservation Locked to a VM family / resource type & region Stable, predictable footprint Highest
Savings Plan (compute) Any compute, any region, hourly $ commitment Mixed, shifting fleets High, below RI
Spot Evictable, no SLA Fault-tolerant, interruptible batch Up to ~90%

The decision rule is utilization risk. A Reservation gives the deepest discount but only pays off if you actually run that family for the whole term. A Savings Plan trades a few points of discount for the freedom to move between VM families, regions, and even services (Functions Premium, Container Instances) as long as you keep spending the committed hourly rate.

Break-even. A commitment is worth it when the discounted run-rate over the term beats keeping the resource on-demand for the fraction of the term you’ll actually use it. For a 1-year Reservation at a 40% discount:

break-even utilization = (1 - discount)  ... no:
break-even utilization = on-demand cost recovered / on-demand cost over term

If RI price = 0.60 x on-demand, you break even once the resource
runs > 60% of the year. Above that, every extra hour is pure savings.

In practice: commit to your floor (the capacity you are certain to run 24/7) with Reservations, cover the variable middle with a Savings Plan, and burst the interruptible top on Spot. Buy with a coverage target, e.g. 75–85% of eligible compute committed, deliberately leaving headroom so you are never paying for committed capacity you can’t fill. Reservations also support instance size flexibility within a family, so a commitment to one size auto-applies across sizes in the same group at the right ratio.

Step 5 — Rightsizing with Advisor and metrics

Azure Advisor continuously analyzes utilization and emits cost recommendations. Pull them programmatically and feed them into your review:

az advisor recommendation list \
  --category Cost \
  --query "[].{resource:impactedValue, problem:shortDescription.problem, savings:extendedProperties.savingsAmount}" \
  -o table

Advisor is a starting point, not gospel — its VM downsizing logic is CPU/network-weighted and can miss memory-bound workloads. Validate against actual metrics before resizing. Pull P95 CPU over the last fortnight:

az monitor metrics list \
  --resource "$VM_ID" \
  --metric "Percentage CPU" \
  --interval PT1H \
  --start-time 2026-06-01T00:00:00Z \
  --aggregation Average Maximum \
  -o table

Beyond VM SKUs, the cheap, high-confidence wins are orphaned resources that bill while doing nothing:

# Unattached managed disks (no managedBy) still cost full provisioned GB
az disk list --query "[?managedBy==null].{name:name, rg:resourceGroup, gb:diskSizeGb, sku:sku.name}" -o table

# Public IPs not associated with any NIC or load balancer (Standard SKU bills hourly)
az network public-ip list --query "[?ipConfiguration==null && natGateway==null].{name:name, rg:resourceGroup, sku:sku.name}" -o table

Add to the list: idle load balancers, empty App Service Plans, premium snapshots no one tracks, and ungated dev/test SKUs running production tiers. Each is a recurring charge against zero value.

Step 6 — Automating waste cleanup

Recommendations that require a human to act on them decay. Encode the easy decisions.

Auto-shutdown for non-prod. Dev and test VMs rarely need to run nights and weekends. The built-in auto-shutdown schedule is one call and cuts a 24/7 VM bill by roughly 65% on a weekday-business-hours schedule:

az vm auto-shutdown \
  --resource-group rg-dev \
  --name vm-dev-01 \
  --time 1900 \
  --email "team-dev@contoso.com"

For start/stop on a schedule (not just shutdown), drive it from an Automation runbook or a Function on a timer trigger, scoped by tag so it self-discovers new machines:

# Stop every VM tagged Environment=dev, AutoStop=true — runbook on a 7pm schedule
$vms = Get-AzResource -TagName "AutoStop" -TagValue "true" `
  | Where-Object { $_.ResourceType -eq "Microsoft.Compute/virtualMachines" }
foreach ($vm in $vms) {
  Stop-AzVM -ResourceGroupName $vm.ResourceGroupName -Name $vm.Name -Force -NoWait
}

Policy-driven SKU limits. Stop the waste before it is created. Restrict which VM SKUs a dev/test subscription may deploy with the built-in “Allowed virtual machine size SKUs” policy, so nobody spins up an M-series box for a build agent:

az policy assignment create \
  --name "limit-dev-skus" \
  --scope "/subscriptions/$DEV_SUB_ID" \
  --policy "cccc23c7-8427-4f53-ad12-b6a63eb452b3" \
  --params '{"listOfAllowedSKUs":{"value":["Standard_B2s","Standard_B2ms","Standard_D2s_v5"]}}'

Enterprise scenario

A platform team running ~900 VMs across 40 subscriptions bought a large 3-year Reservation for the Dsv5 family on their then-dominant region. Six months later a tenancy migration shifted that workload to Easv5 for memory headroom, and reservation utilization fell from 97% to 61%. The amortized export made it visible immediately, but the gotcha was scope: the RIs had been purchased with shared billing scope, so Cost Management was auto-applying the leftover discount to any matching Dsv5 VM org-wide — including dev/test boxes that should never have absorbed a 3-year commitment. The “savings” were real but landing on the wrong cost centers, and showback no longer reconciled.

The fix was not to cancel. Reservations support exchange with no penalty, so they swapped the stranded Dsv5 capacity toward a compute Savings Plan that floats across families and regions, then re-scoped the residual RIs to the single subscription that genuinely ran them 24/7:

az reservations reservation-order calculate-exchange \
  --reservations-to-exchange '[{"reservationId":"'"$RES_ID"'","quantity":40}]' \
  --savings-plans-to-purchase '[{"billingScopeId":"/subscriptions/'"$SUB_ID"'",
      "term":"P3Y","appliedScopeType":"Single","commitment":{"amount":12.5,"currencyCode":"USD","grain":"Hourly"}}]'

The lesson the team codified: default new commitments to single-subscription scope unless you have a deliberate pooling strategy, and treat utilization < 95% as an exchange trigger reviewed monthly — not an end-of-term surprise.

Verify

Confirm the loop is actually closed, not just configured:

# 1. Tag compliance — non-compliant resources for your required-tag policies
az policy state summarize \
  --management-group contoso \
  --query "policyAssignments[?contains(policyAssignmentId,'require-owner-tag')].results.nonCompliantResources"

# 2. Commitment utilization — are you actually filling what you bought?
az consumption reservation summary list --grain monthly \
  --reservation-order-id "$RO_ID" \
  --query "[].{used:avgUtilizationPercentage, min:minUtilizationPercentage}" -o table

# 3. Budgets are armed and pointing at owners
az consumption budget list --query "[].{name:name, amount:amount, alerts:length(notifications)}" -o table

# 4. The export ran and produced data
az storage blob list --account-name "$SA" --container-name cost-exports \
  --prefix amortized --query "[-1].{name:name, modified:properties.lastModified}" -o table

If reservation utilization is below ~95%, you over-bought a family — exchange it toward a Savings Plan or a family you actually run.

FinOps readiness checklist

Reporting cadence and unit economics

Total spend is a vanity metric — it goes up when the business grows, which tells you nothing about efficiency. Report unit economics: cost per order, per tenant, per 1,000 requests, per active user. That is the number that should trend down even as absolute spend rises.

Anchor a monthly review on three KPIs:

Pitfalls

FinOps works when cost is a first-class signal in the same loop as reliability and performance — instrumented, attributed to an owner, governed by policy, and acted on automatically. Get the tag taxonomy and amortized allocation right first; commitments and rightsizing only pay off once you can see, accurately, where every dollar lands.

AzureFinOpsCost ManagementReservationsTaggingGovernance

Comments

Keep Reading