Most “we have Azure Monitor” stories fall apart under two questions: what exactly are you collecting, and what is it costing you per GB per month? The answer is usually a shrug, a legacy MMA agent nobody dares remove, and a Log Analytics bill that grew 40% last quarter with no new workloads. The modern stack fixes this by making collection an explicit, versioned artifact – a Data Collection Rule (DCR) – and by letting you drop or reshape data before you pay to ingest it. This piece builds the whole chain as code: DCRs and endpoints feeding the Azure Monitor Agent, ingestion-time transformations that trim cost, a workspace and table design that matches your retention economics, workbooks that turn KQL into something an on-call engineer can actually read, metric and log alerts that scale across resources, and action groups that hand off to automation instead of paging a human at 3am.
Mental model. Azure Monitor has a collection plane and a signal plane. The collection plane (DCRs, DCEs, the Azure Monitor Agent, the Logs Ingestion API) decides what lands in a table and in what shape. The signal plane (metric alerts, scheduled-query alerts, action groups, alert processing rules) decides what humans and automation hear about. Teams that conflate the two end up alerting on raw firehose data they should have filtered at ingestion. Filter low, alert high.
1. Data Collection Rules, endpoints, and the Azure Monitor Agent
The legacy Log Analytics agent (MMA/OMS) is retired as of 31 August 2024. The replacement is the Azure Monitor Agent (AMA), and AMA does nothing on its own – it is driven entirely by Data Collection Rules associated to a machine. A DCR is a first-class ARM resource that declares three things: dataSources (what to read – perf counters, syslog facilities, Windows event logs), destinations (where to send – one or more Log Analytics workspaces), and dataFlows (which source maps to which destination table, and an optional transformation).
A Data Collection Endpoint (DCE) is the ingestion entry point. You need an explicit DCE when you use the Logs Ingestion API (custom logs pushed over REST) or when you require Private Link for ingestion via an Azure Monitor Private Link Scope (AMPLS). For plain AMA collection over public networking, a DCE is optional, but standardising on one keeps Private Link a config change rather than a re-architecture.
Register providers and create the endpoint:
az provider register --namespace Microsoft.Insights
az provider register --namespace Microsoft.OperationalInsights
# Data Collection Endpoint -- the ingestion entry point
az monitor data-collection endpoint create \
--name dce-platform-eastus \
--resource-group rg-observability \
--location eastus \
--public-network-access Enabled
Now the DCR. This one collects a focused set of Linux perf counters and syslog, sending them to a workspace. Note the streams names: built-in tables use Microsoft-Perf, Microsoft-Syslog, Microsoft-Event, etc.
{
"location": "eastus",
"properties": {
"dataCollectionEndpointId": "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/dataCollectionEndpoints/dce-platform-eastus",
"dataSources": {
"performanceCounters": [
{
"name": "perf-core",
"streams": ["Microsoft-Perf"],
"samplingFrequencyInSeconds": 60,
"counterSpecifiers": [
"\\Processor(_Total)\\% Processor Time",
"\\Memory\\Available MBytes",
"\\LogicalDisk(_Total)\\% Free Space"
]
}
],
"syslog": [
{
"name": "syslog-warn",
"streams": ["Microsoft-Syslog"],
"facilityNames": ["auth", "daemon", "syslog"],
"logLevels": ["Warning", "Error", "Critical", "Alert", "Emergency"]
}
]
},
"destinations": {
"logAnalytics": [
{
"name": "la-platform",
"workspaceResourceId": "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.OperationalInsights/workspaces/law-platform"
}
]
},
"dataFlows": [
{ "streams": ["Microsoft-Perf"], "destinations": ["la-platform"] },
{ "streams": ["Microsoft-Syslog"], "destinations": ["la-platform"] }
]
}
}
Create it and associate machines. Association is what actually arms the agent:
az monitor data-collection rule create \
--name dcr-linux-platform \
--resource-group rg-observability \
--location eastus \
--rule-file ./dcr-linux-platform.json
# Bind the DCR to a VM (repeat per machine, or drive via Policy at scale)
az monitor data-collection rule association create \
--name dcra-vm-app-01 \
--rule-id "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/dataCollectionRules/dcr-linux-platform" \
--resource "/subscriptions/<sub>/resourceGroups/rg-fleet/providers/Microsoft.Compute/virtualMachines/vm-app-01"
At fleet scale you never run that association by hand. Use the built-in Azure Policy initiative that installs AMA and creates the association from a DCR parameter, assigned at a management-group scope with a DeployIfNotExists effect and a managed identity for remediation. One machine or ten thousand, the same DCR is the unit of intent.
2. Ingestion-time transformations and KQL filtering for cost control
This is the highest-leverage feature in the whole platform and the one most teams have never enabled. A transformation is a KQL snippet attached to a dataFlow that runs at ingestion time, before data is billed and stored. You can drop rows, drop columns, redact PII, and project new computed fields. Because billing is on ingested volume, a transformation that filters 60% of chatty Information-level syslog is a direct, permanent line-item reduction.
The transform operates on a pipeline variable named source and must project the columns that match the destination table’s schema. Add a transformKql to the relevant data flow:
"dataFlows": [
{
"streams": ["Microsoft-Syslog"],
"destinations": ["la-platform"],
"transformKql": "source | where SeverityLevel != 'info' | where ProcessName !in ('CRON','sudo') | project TimeGenerated, Computer, Facility, SeverityLevel, SyslogMessage"
}
]
A few rules that bite people:
- The transform output schema must match the target table. If you
projectaway a column the table requires, ingestion silently drops or nulls it – validate against the table schema, not your assumptions. TimeGeneratedmust survive the transform. If you drop it, every row gets stamped at ingestion time and your time-series goes flat.- Transformations apply to a specific stream into a specific destination. To redact across many sources you attach a transform to each data flow; there is no single global filter.
For custom logs over the Logs Ingestion API, the transform is even more powerful because you control the input shape. A common pattern is to send fat JSON and let the transform split it into a normal column and a DynamicJson blob, or to compute a severity from a free-text message:
source
| extend Severity = case(
Message has_cs "ERROR", "Error",
Message has_cs "WARN", "Warning",
"Information")
| where Severity != "Information"
| project TimeGenerated = todatetime(EventTime), Computer, Severity, Message
Cost rule of thumb. Filter at ingestion for volume you will never query (debug chatter, health-probe 200s). Use a cheaper table plan (next section) for volume you query rarely but must retain. Never solve a cost problem by turning off collection you will wish you had during an incident.
3. Log Analytics workspace design, tables, and table-level plans
Two workspace decisions dominate the bill: how many workspaces you run, and the table plan on each table. The modern guidance is few workspaces, many tables, per-table plans – one regional platform workspace per major boundary rather than a workspace per team, because cross-workspace KQL (workspace()/union) is awkward and access control is now solvable at the table and row level.
Azure Monitor Logs offers three table plans:
| Plan | Use for | Query | Retention model |
|---|---|---|---|
| Analytics | Hot, frequently queried signals (alerts, dashboards) | Full KQL, fast | Interactive retention (up to long term) |
| Basic | High-volume, occasionally queried logs (verbose app/network logs) | KQL subset, per-query billed | Short interactive + long-term archive |
| Auxiliary | Very high-volume, low-fidelity (raw audit, verbose firewall) | Limited KQL, lowest ingest cost | Long-term, cheapest ingest |
Set retention with two dials: interactive retention (queryable without restore) and total retention (interactive + cheap long-term archive). Alert rules and dashboards must read from interactive retention; archived data needs a search job or restore first.
# Create the workspace
az monitor log-analytics workspace create \
--resource-group rg-observability \
--workspace-name law-platform \
--location eastus \
--retention-time 90
# Move a chatty custom table to the Basic plan and set retention split
az monitor log-analytics workspace table update \
--resource-group rg-observability \
--workspace-name law-platform \
--name AppVerbose_CL \
--plan Basic \
--retention-time 30 \
--total-retention-time 365
Pair this with table-level RBAC so an app team sees its own *_CL tables but not the platform security tables, instead of minting a workspace per team just to scope access.
4. Workbooks: parameters, queries, and reusable visual templates
A workbook is a JSON template (an ARM resource of type Microsoft.Insights/workbooks) that combines parameters, KQL query steps, text, and visualisations. The feature that makes them reusable – rather than a screenshot with extra steps – is parameters: a parameter is itself usually a KQL query, and downstream steps interpolate it with {ParamName}.
The pattern that scales: a top-of-workbook time-range parameter plus a resource/subscription picker, then every query references both. Here is the parameter-and-query skeleton inside the workbook items array:
{
"type": 9,
"content": {
"parameters": [
{
"name": "TimeRange",
"type": 4,
"isRequired": true,
"value": { "durationMs": 3600000 }
},
{
"name": "Subscription",
"type": 6,
"query": "summarize by subscriptionId",
"queryType": 1,
"crossComponentResources": ["value::all"]
}
]
}
}
A query step that consumes them. Note {TimeRange} expands into a full where TimeGenerated ... clause and the time-brush feeds the chart automatically:
Perf
| where TimeGenerated {TimeRange}
| where CounterName == "% Processor Time" and InstanceName == "_Total"
| summarize avg(CounterValue) by Computer, bin(TimeGenerated, 5m)
| render timechart
Two practices keep workbooks maintainable. First, pin parameter queryType and crossComponentResources so the same template works whether it is scoped to one resource or an entire subscription. Second, template it, then publish as a gallery template via Bicep so every team gets the same “service health” workbook rather than forking ten copies:
resource wb 'Microsoft.Insights/workbooks@2023-06-01' = {
name: guid('platform-health-workbook')
location: location
kind: 'shared'
properties: {
displayName: 'Platform Health'
category: 'workbook'
sourceId: workspaceResourceId
serializedData: loadTextContent('./workbooks/platform-health.json')
}
}
5. Metric alerts, dynamic thresholds, and multi-resource scoping
Metric alerts evaluate platform metrics (or custom metrics) on a near-real-time, pre-aggregated stream – they are cheap, fast, and stateful. Two capabilities make them scale. Multi-resource scope lets one alert rule watch every VM in a resource group or subscription of the same type, so you author one rule instead of one-per-VM. Dynamic thresholds replace a hand-picked number with a machine-learned band over the metric’s history, which is the only sane choice for metrics whose “normal” varies by time of day.
A static, multi-resource CPU alert over an entire resource group:
az monitor metrics alert create \
--name "vm-cpu-high" \
--resource-group rg-observability \
--scopes "/subscriptions/<sub>/resourceGroups/rg-fleet" \
--target-resource-type "Microsoft.Compute/virtualMachines" \
--target-resource-region eastus \
--condition "avg Percentage CPU > 85" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 2 \
--action "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"
For dynamic thresholds the condition uses the dynamic operator with a sensitivity and a violation count (4 violations out of 4 periods is far less noisy than 1 of 1):
az monitor metrics alert create \
--name "vm-cpu-dynamic" \
--resource-group rg-observability \
--scopes "/subscriptions/<sub>/resourceGroups/rg-fleet" \
--target-resource-type "Microsoft.Compute/virtualMachines" \
--target-resource-region eastus \
--condition "avg Percentage CPU >< dynamic medium 4 of 4" \
--window-size 5m \
--evaluation-frequency 5m \
--severity 2 \
--action "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"
Auto-mitigation matters. Metric alerts are stateful: a fired alert auto-resolves when the condition clears (default behaviour), and the action group is notified of resolved as well as fired. Do not build alert logic that assumes you must manually close alerts – wire your downstream automation to handle the resolved signal too.
6. Scheduled query (log) alerts and stateful alert processing
When the signal lives in logs rather than a metric – “more than 20 5xx responses from one pod in 5 minutes,” “a privileged role was assigned” – you need a scheduled query rule (Microsoft.Insights/scheduledQueryRules, API version 2023-12-01 and later, sometimes called Log Alerts v2). It runs KQL on a schedule, compares an aggregated result to a threshold, and fires.
The two settings that separate a good log alert from an alert storm are stateful alerts (autoMitigate) and dimensions. Dimensions split one rule into one alert per value of a grouping column – so a rule grouped by Computer fires a separate, independently-resolving alert per machine, instead of one giant alert that flaps.
az monitor scheduled-query create \
--name "syslog-error-burst" \
--resource-group rg-observability \
--scopes "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.OperationalInsights/workspaces/law-platform" \
--condition "count 'errs' > 20" \
--condition-query errs='Syslog | where SeverityLevel in ("err","crit","alert","emerg")' \
--dimension "Computer" \
--window-size 5m \
--evaluation-frequency 5m \
--severity 2 \
--auto-mitigate true \
--action-groups "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"
Three principal-level rules for log alerts:
- Aggregate inside the query, not in your head. The rule compares a single aggregated number per dimension to the threshold.
summarize count() by Computer, bin(TimeGenerated, 5m)keeps the evaluation deterministic. - Keep
evaluation-frequency>= the data latency. Log ingestion has minutes of latency; evaluating every 1 minute against data that arrives every 3 produces false negatives and duplicate fires. Match frequency to reality. - Read from interactive retention only. Alert queries cannot reach archived (long-term) data without a restore. If a table is on the Basic/Auxiliary plan with short interactive retention, your alert window must fit inside it.
7. Action groups, alert processing rules, and suppression
An action group is the reusable fan-out target: a named bundle of notifications (email, SMS, push, voice) and actions (webhook, Logic App, Function, Automation Runbook, ITSM connector). Every alert type – metric, log, activity log, Service Health – points at the same action group resource, so you manage on-call routing in one place.
az monitor action-group create \
--name ag-platform-oncall \
--resource-group rg-observability \
--short-name pltoncall \
--action email oncall-lead [email protected] \
--action webhook pagerduty https://events.pagerduty.com/integration/<key>/enqueue \
--action logicapp incident-workflow \
"/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Logic/workflows/wf-incident" \
"/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Logic/workflows/wf-incident/triggers/manual/paths/invoke"
The piece teams miss is the alert processing rule (Microsoft.AlertsManagement/actionRules). It sits between alerts and action groups and does two jobs without touching a single alert rule:
- Suppression – mute notifications across a scope on a schedule (a maintenance window) so 400 VMs being patched do not page anyone.
- Add action groups – bolt an action group onto every alert in a scope (e.g., add the SecOps action group to all
Sev0/Sev1alerts in production) centrally.
Maintenance-window suppression across a resource group:
az monitor alert-processing-rule create \
--name "suppress-maint-window" \
--resource-group rg-observability \
--scopes "/subscriptions/<sub>/resourceGroups/rg-fleet" \
--rule-type RemoveAllActionGroups \
--filter-severity Equals Sev2 Sev3 \
--schedule-recurrence-type Weekly \
--schedule-start-time "02:00:00" \
--schedule-end-time "04:00:00" \
--schedule-recurrence Sunday \
--description "Mute Sev2/Sev3 during Sunday patch window"
This is how you keep a noisy estate humane: the alert rules stay armed and the processing layer decides who hears them and when.
8. Automation hooks to Logic Apps, Functions, and webhooks
The point of all of the above is to do something without a human. An action group can call a Logic App, an Azure Function, or a raw webhook, passing the alert as JSON. Use the common alert schema so every downstream gets the same envelope regardless of whether a metric or log alert fired – otherwise your Function has to parse three different payload shapes.
A Function that auto-remediates by parsing the common schema and restarting a service (sketch, Node.js):
module.exports = async function (context, req) {
const alert = req.body?.data?.essentials;
if (!alert) { context.res = { status: 400, body: "no alert payload" }; return; }
context.log(`Alert ${alert.alertRule} is ${alert.monitorCondition} (${alert.severity})`);
// Only act on a freshly fired alert, ignore the auto-resolve callback
if (alert.monitorCondition === "Fired") {
const target = alert.alertTargetIDs?.[0];
context.log(`Remediating ${target}`);
// ... call ARM / Az SDK to restart/scale the resource ...
}
context.res = { status: 202, body: "accepted" };
};
The two non-negotiables for automation handlers:
- Idempotency. Alerts can fire, resolve, and re-fire; an action group may retry on a non-2xx. Your handler must tolerate being invoked twice for the same incident without doubling the action.
- Fast ack, async work. Return
202quickly and push slow remediation onto a queue. A webhook that blocks for 90 seconds will be retried, producing duplicate work.
For richer orchestration – approvals, multi-step runbooks, ServiceNow tickets – a Logic App is the better target: enable the common alert schema on the action, and the trigger body is the same well-known structure, no custom parsing.
Verify
Walk the full chain before declaring victory:
- Data is landing. After associating a DCR, query the target table (
Perf | take 10,Syslog | take 10) and confirm rows with a recentTimeGenerated. No rows means a bad association or a transform that droppedTimeGenerated. - The transform actually fired. Compare row counts before and after enabling
transformKql. The volume drop should match the filter. Confirm_IsBillableand ingested GB inUsagereflect the reduction. - Table plans are set. Run
az monitor log-analytics workspace table showand confirm theplan,retentionInDays, andtotalRetentionInDayson each high-volume table match intent. - Workbook parameters cascade. Open the workbook, change the time range and subscription pickers, and confirm every step re-queries. A step that ignores
{TimeRange}is hardcoded – fix it. - A metric alert fires and resolves. Drive a test load (or lower the threshold temporarily) and confirm the alert moves to Fired, the action group notifies, and it auto-resolves when the condition clears.
- A log alert respects dimensions. Generate errors on two machines and confirm you get two independent alerts grouped by
Computer, each resolving on its own. - Suppression works. Inside the alert-processing-rule window, fire a matching-severity alert and confirm no notification is sent while the alert still records in the portal.
- Automation is idempotent. Invoke the Function/Logic App twice with the same payload and confirm a single net effect.
Enterprise scenario
A payments platform team ran ~1,400 VMs plus AKS across three regions and watched their Log Analytics spend cross a six-figure annual run rate. The forensic finding: a single chatty Syslog stream and a verbose application diagnostic table accounted for roughly 70% of ingested volume, and almost none of it was ever queried – it existed because the old MMA config collected everything and nobody had revisited it after the migration to AMA. The constraint was hard: the security team had a regulatory requirement to retain authentication and audit events for one year, so “just stop collecting” was off the table for that slice.
They solved it with the collection plane, not the bill. First, an ingestion-time transformation on the syslog data flow dropped Information-level rows and noise from cron/health-probe processes, which alone cut syslog volume by more than half at zero query-experience cost. Second, the verbose app table was moved to the Basic plan with 30 days interactive and 365 days total retention – the rare incident query could still reach it, but ingest cost per GB fell sharply. Third, the regulated auth/audit events were split into their own table kept on the Analytics plan so security’s alerts and one-year retention were untouched. The net was a ~45% reduction in monthly ingestion cost with no loss of any signal anyone actually used, and every change shipped as a reviewed DCR/table change in the platform repo. The load-bearing change was a few lines of KQL on a data flow:
source
| where SeverityLevel !in ("info", "debug", "notice")
| where ProcessName !in ("CRON", "systemd", "kubelet-health")
| project TimeGenerated, Computer, Facility, SeverityLevel, SyslogMessage
The lesson the team took away: in Azure Monitor, cost is a collection design decision, not a billing surprise. Once the DCR was the unit of intent, “what do we collect and what does it cost” became a pull request instead of a quarterly autopsy.