Most “cost optimization” efforts die because they are a quarterly spreadsheet, not a system. FinOps only works when the data is trustworthy, the allocation is defensible, and the optimization is automated enough to survive contact with a busy engineering org. This is the operational practice I build on Azure: tagging you can actually enforce, showback finance and engineering both believe, and optimization that runs without anyone remembering to run it.
The FinOps lifecycle: inform, optimize, operate
The FinOps Foundation frames the discipline as three iterating phases, and it is worth anchoring to because each phase has a different failure mode.
- Inform — make spend visible and allocatable. Failure mode: untagged resources and shared costs nobody owns, so every report is contested.
- Optimize — rightsize, shut down idle resources, and commit to discounts. Failure mode: one-off manual cleanups that regress within a quarter.
- Operate — bake cost into daily engineering and the org’s processes. Failure mode: cost stays a finance concern, never a deploy-time one.
You build them in order. Optimization recommendations are noise until allocation is solid, and “operate” gates are resented until the numbers behind them are trusted. The rest of this article is that order, made concrete.
Step 1 — A tagging taxonomy and enforcing it with Azure Policy
Allocation lives and dies on tags. Decide a small mandatory set first; resist the urge to mandate fifteen. My default mandatory tags:
| Tag | Purpose | Example |
|---|---|---|
costCenter |
Maps to the finance GL/cost center | CC-4815 |
owner |
Accountable team alias, not a person | team-checkout |
environment |
Drives policy and showback splits | prod |
application |
The product/service the resource serves | orders-api |
Two structural decisions matter more than the tag list. First, mirror the taxonomy onto subscriptions and resource groups via a management group hierarchy, so allocation has a fallback when individual resources slip through. Second, enforce at write time with Azure Policy, not with after-the-fact reports.
The key trick: combine an inherit tag from the resource group modify policy with a deny on resource groups that lack the tag. This way humans only have to tag the resource group, and resources inherit automatically. Use the built-in policy definitions by their stable GUIDs.
# Built-in: "Inherit a tag from the resource group if missing"
INHERIT_DEF="cd3aa116-8754-49c9-a813-ad46512ece54"
# Built-in: "Require a tag on resource groups"
RG_REQUIRE_DEF="96670d01-0a4d-4649-9c89-2d3abc0a5025"
MG="mg-platform" # management group scope
az policy assignment create \
--name "inherit-costcenter" \
--display-name "Inherit costCenter from RG" \
--policy "$INHERIT_DEF" \
--scope "/providers/Microsoft.Management/managementGroups/$MG" \
--location eastus \
--mi-system-assigned \
--params '{ "tagName": { "value": "costCenter" } }'
az policy assignment create \
--name "require-rg-costcenter" \
--display-name "Require costCenter on resource groups" \
--policy "$RG_REQUIRE_DEF" \
--scope "/providers/Microsoft.Management/managementGroups/$MG" \
--params '{ "tagName": { "value": "costCenter" } }'
The
inheritpolicy uses amodifyeffect, which means it needs a managed identity and a role assignment to write tags.--mi-system-assignedprovisions the identity; you still must grant itTag Contributor(orContributor) at the scope. Runaz policy remediation createafterward to backfill existing resources — new assignments only act on new writes until you remediate.
Repeat the inherit assignment for owner, environment, and application. For the dimensions that must never be wrong (environment, costCenter), back them with a deny policy on the resources themselves in production management groups, accepting a small amount of developer friction in exchange for clean data. Keep deny out of sandbox subscriptions — there, audit plus a weekly compliance nudge is enough.
Step 2 — Cost allocation, shared costs, and building showback/chargeback
Once tags exist, the genuinely hard part is shared cost — the things no single team provisions but everyone uses: an AKS cluster’s system node pool, a shared Application Gateway, Log Analytics ingestion, NAT Gateway egress, hub firewall. If you ignore these, your showback under-reports by 15-40% and finance stops trusting it.
There are two defensible allocation strategies; pick one per cost type and document it:
- Proportional — split a shared cost by each team’s share of a driver (their compute spend, their namespace’s CPU requests, their request count). Good for genuinely fungible shared infra.
- Even/fixed — split equally or by a negotiated fixed key. Good for “tax” services like security tooling where usage-based splits invite gaming.
Azure has a native primitive for this: Cost allocation rules in Cost Management, which let you redistribute costs from a source (subscription, resource group, or tag) to targets proportionally or by a fixed percentage. They are created in the portal under Cost Management > Cost allocation; the resulting splits then flow into Cost Analysis and exports as if the target teams had incurred them. This is the cleanest path for subscription-level shared services.
For intra-cluster allocation (one AKS cluster, many teams by namespace), allocation rules are too coarse — you need usage data Azure does not see. Run OpenCost (the CNCF project; Kubecost is the commercial build) in the cluster to attribute node cost down to namespace and workload based on resource requests and actual usage.
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update
helm install opencost opencost/opencost \
--namespace opencost --create-namespace
The distinction between showback and chargeback is organizational, not technical: showback reports each team’s cost for visibility; chargeback actually moves budget. Start with showback for at least one full quarter. Chargeback before the data is trusted turns every month-end into a dispute and poisons the whole program.
Step 3 — Budgets, anomaly detection, and actionable cost alerts
Budgets in Azure are not spending caps — they are alerting thresholds. Create them per cost center or per environment, and wire the action group to something a human owns, not a shared mailbox that everyone mutes.
az consumption budget create \
--budget-name "cc-4815-monthly" \
--amount 25000 \
--category cost \
--time-grain Monthly \
--start-date 2026-06-01 \
--end-date 2027-06-01 \
--resource-group rg-orders-prod
Set thresholds at 80% (actual), 100% (actual), and crucially a forecasted threshold around 100-110%. Forecasted alerts fire early in the month when you can still act; actual alerts at 100% fire when the money is already spent. Budget notifications can target both email recipients and an action group, which is what lets you fan out to Teams, PagerDuty, or a webhook that opens a ticket.
Two thresholds is not anomaly detection, though — a budget can’t catch a service that doubled while staying under budget. Azure Cost Management has a built-in anomaly detection model that flags unusual daily spend patterns per subscription. Surface it programmatically with the Cost Management forecast/anomaly APIs, or subscribe to scheduled anomaly alerts (configured under Cost alerts > Anomaly alerts). The signal you want is “this resource group’s run-rate changed in a way the model didn’t predict,” routed to the owning team, not aggregated to a director who can’t act on it.
Step 4 — Rightsizing and shutdown automation for idle resources
Recommendations without automation are a backlog that never clears. Split this into advisory (act with judgment) and mechanical (just do it on a schedule).
Advisory — Azure Advisor cost recommendations. Advisor identifies idle and underutilized VMs, idle disks, unused public IPs, and rightsizing candidates based on actual utilization. Pull them as data so they land in a dashboard and a backlog, not a portal blade nobody opens.
az advisor recommendation list \
--category Cost \
--query "[].{resource:impactedValue, problem:shortDescription.problem, savings:extendedProperties.savingsAmount}" \
-o table
Mechanical — auto-shutdown of non-prod compute. Idle dev/test VMs running nights and weekends are pure waste. The native, zero-infrastructure approach is the VM auto-shutdown feature (DevTest Labs surfaces it per-VM, but it exists on standalone VMs too):
# Schedule daily auto-shutdown for a non-prod VM at 19:00 IST
az vm auto-shutdown \
--resource-group rg-dev \
--name vm-build-01 \
--time 1900 \
--location eastus
For anything fleet-wide or with start/stop logic, drive it from tags so the schedule is self-service. Tag VMs with shutdown=2200 / startup=0700, then run a scheduled job (Azure Automation runbook or a Function on a timer trigger) that queries Resource Graph for the tag and acts:
# Find VMs opted into scheduled shutdown via tag, across subscriptions
az graph query -q "
Resources
| where type == 'microsoft.compute/virtualmachines'
| where isnotempty(tags['shutdown'])
| project name, resourceGroup, subscriptionId, shutdownAt = tags['shutdown']
"
Auto-shutdown deallocates the VM, so you stop paying for compute but still pay for the OS and data disks. That is the intended trade-off for dev/test. For idle managed disks and orphaned public IPs that Advisor flags, deletion (after a grace period and a snapshot for disks) is where the real recurring savings hide.
Step 5 — Commitment strategy: reservations vs savings plans vs spot
Once usage is steady, commitment discounts are the largest single lever. The three instruments are not interchangeable.
| Instrument | Discount vs PAYG | Flexibility | Best for |
|---|---|---|---|
| Reserved Instances | Highest (up to ~72%) | Locked to a VM family/region (with instance-size flexibility) | Stable, predictable workloads you won’t re-platform |
| Savings Plans | High (slightly below RIs) | Hourly $/hr commit, flexible across VM family, region, and OS | Dynamic compute that shifts shape over time |
| Spot VMs | Up to ~90% | None — can be evicted with 30s notice | Interruptible, stateless, checkpointable batch/CI |
The decision rule I use: cover the stable baseline of your compute with a 1- or 3-year Savings Plan for flexibility, top up with Reservations only where a workload is genuinely pinned to a family (large stateful databases, fixed SKUs), and push interruptible work — CI runners, batch, dev clusters — onto Spot. Aim to cover 60-80% of steady-state on commitments, never 100%; you want headroom so a re-platform doesn’t strand a commitment.
Use the Reservation and Savings Plan recommendation APIs (also surfaced in Cost Management > Reservations) — they model your actual usage and recommend commit amounts and break-even. Treat their “look-back” window carefully: a 7-day window over-commits if last week was a spike. Validate against a 30/60-day view before purchasing.
Spot is a scheduling decision. On AKS, a dedicated spot node pool with the right taint keeps spot-tolerant work isolated:
az aks nodepool add \
--resource-group rg-aks \
--cluster-name aks-prod \
--name spotpool \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--enable-cluster-autoscaler \
--min-count 0 --max-count 20 \
--node-taints "kubernetes.azure.com/scalesetpriority=spot:NoSchedule"
--spot-max-price -1 means “pay up to the on-demand price,” which maximizes availability while still capturing the spot discount.
Step 6 — Exporting cost data and building unit-economics dashboards
The portal is for browsing; a real practice needs cost data as a queryable artifact. Configure a scheduled Cost Management export to a storage account in FOCUS format — the FinOps Foundation’s open cost-and-usage spec — so your pipeline isn’t coupled to Azure’s proprietary column layout and can ingest other clouds the same way.
az costmanagement export create \
--name "daily-focus-export" \
--scope "/subscriptions/<sub-id>" \
--storage-account-id "/subscriptions/<sub-id>/resourceGroups/rg-finops/providers/Microsoft.Storage/storageAccounts/stfinopsexports" \
--storage-container "cost-exports" \
--storage-directory "focus" \
--recurrence Daily \
--recurrence-period from="2026-06-01T00:00:00Z" to="2027-06-01T00:00:00Z" \
--definition-type ActualCost \
--dataset-granularity Daily
From there, ingest into your warehouse (Microsoft Fabric, Synapse, Databricks, or BigQuery if you’re multi-cloud) and join cost to a business metric. The number leadership actually cares about is unit economics: cost per tenant, per order, per active user, per GB processed. A flat “$48k on compute” tells no one whether you’re efficient; “$0.021 per order, down from $0.027” drives decisions.
The join key is the same tagging taxonomy from Step 1. application and environment let you slice cost to a service; a metrics table keyed by the same application gives you the denominator. Build the unit-cost trend, not just the absolute, and put it where engineers see it.
Step 7 — Embedding cost gates into IaC and CI/CD
This is the “operate” phase, and it is what makes the practice durable: cost becomes a property reviewed at deploy time, like a failing test. The tool I reach for is Infracost, which estimates the monthly delta of a Terraform/Bicep change in the pull request.
# .github/workflows/cost-gate.yml
name: Cost Gate
on: [pull_request]
jobs:
infracost:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Infracost
uses: infracost/actions/setup@v3
with:
api-key: ${{ secrets.INFRACOST_API_KEY }}
- name: Generate cost estimate baseline
run: |
infracost breakdown --path=. \
--format=json --out-file=/tmp/infracost-base.json
- name: Post cost diff to PR
run: |
infracost comment github --path=/tmp/infracost-base.json \
--repo=$GITHUB_REPOSITORY \
--pull-request=${{ github.event.pull_request.number }} \
--github-token=${{ secrets.GITHUB_TOKEN }}
Pair the estimate with policy-as-code so the gate has teeth. Infracost integrates with OPA/Conftest, or you can run Conftest directly against the Terraform plan to fail PRs that, say, provision an untagged resource or a VM SKU above an approved tier:
terraform plan -out=tfplan
terraform show -json tfplan > plan.json
conftest test plan.json --policy ./policy
Keep the IaC gate aligned with the Azure Policy from Step 1 — the same costCenter/owner requirement, caught earlier and cheaper in the PR instead of denied at terraform apply. Two layers, same rule, shift-left.
Enterprise scenario
A retail platform team I worked with ran the FOCUS export from Step 6 into Synapse and built clean per-application showback. It held for everything except their largest line item: AKS. One shared production cluster ran twelve teams across namespaces, and the daily cost export attributed the entire node-pool spend to the single subscription that owned the cluster — roughly 22% of the monthly bill landing in one “platform” bucket nobody would claim. Finance refused to sign off on chargeback while a fifth of spend was unallocated.
The gotcha: Azure cost allocation rules redistribute at the subscription/RG/tag grain, but they cannot see inside a cluster. Node cost is real-time bin-packing of pods Azure has no visibility into, so no native primitive could split it by namespace.
The fix was OpenCost for the usage signal, exported as Prometheus metrics, joined to the FOCUS data on application in the warehouse. The critical detail everyone misses: split node cost by CPU/memory requests, not actual usage, otherwise a team that under-requests freeloads on a team that reserves headroom. OpenCost exposes both, so be explicit:
# Per-namespace cost weighted by resource requests (allocation = requests)
curl -G "http://opencost.opencost.svc:9003/allocation/compute" \
--data-urlencode "window=7d" \
--data-urlencode "aggregate=namespace" \
--data-urlencode "idle=true" \
| jq '.data[] | {namespace: .name, cpuCost, ramCost, totalCost}'
Surfacing idle=true separately mattered too: it exposed that unallocatable idle capacity was 18% of cluster cost, which became a rightsizing target instead of a hidden tax. Within two months unallocated spend dropped below 3% and chargeback went live.
Verify
Confirm each layer is actually working before you declare victory:
# 1. Tag enforcement: list non-compliant resources for the inherit policy
az policy state list \
--filter "PolicyAssignmentName eq 'inherit-costcenter' and ComplianceState eq 'NonCompliant'" \
--query "[].{resource:resourceId, policy:policyDefinitionName}" -o table
# 2. Allocation coverage: cost of UNTAGGED resources should trend to near-zero
az graph query -q "
Resources
| where isnull(tags['costCenter']) or tags['costCenter'] == ''
| summarize untaggedResources = count() by type
| order by untaggedResources desc
"
# 3. Budgets exist where they should
az consumption budget list --query "[].{name:name, amount:amount, grain:timeGrain}" -o table
# 4. Export is running and dropping files
az storage blob list \
--account-name stfinopsexports \
--container-name cost-exports \
--prefix "focus/" --query "[].name" -o tsv | tail -5
For the CI gate, open a throwaway PR that adds an untagged resource and confirm the pipeline fails and the Infracost comment posts.
Checklist
Pitfalls
- Tagging by report instead of by policy. If tags are not enforced at write time, allocation coverage decays the moment you stop nagging. Modify-effect inheritance plus a deny on the highest-value dimensions is the only thing that holds.
- Chargeback before trust. Moving real budget on contested numbers turns FinOps into a finance-vs-engineering fight. Earn a quarter of believed showback first.
- Over-committing on a spike. Reservation/Savings Plan recommendations using a 7-day look-back over-buy after a busy week. Validate against 30-60 days and cap coverage below 100%.
- Alert fatigue. Budgets and anomalies routed to a shared mailbox get muted within a month. Route to the owning team’s real channel, and make sure the alert says what changed and what to do, not just that it changed.
- Orphaned resources hiding the savings. The biggest mechanical wins are usually idle disks, unattached IPs, and stopped-but-not-deallocated VMs — not heroic rightsizing. Automate the boring cleanup first.
A FinOps practice is not a dashboard you build once. It is a loop: inform with trustworthy tagged data, optimize with automation that survives without you, and operate by pushing cost left into the PR. Get the tagging and allocation right, and every later step becomes a small, defensible increment instead of a quarterly fight.