Azure Observability

Azure Monitor End to End: Data Collection Rules, Workbooks, Metric/Log Alerts, and Action Group Automation

Most “we have Azure Monitor” stories fall apart under two questions: what exactly are you collecting, and what is it costing you per GB per month? The answer is usually a shrug, a legacy MMA agent nobody dares remove, and a Log Analytics bill that grew 40% last quarter with no new workloads. The modern stack fixes this by making collection an explicit, versioned artifact – a Data Collection Rule (DCR) – and by letting you drop or reshape data before you pay to ingest it. This piece builds the whole chain as code: DCRs and endpoints feeding the Azure Monitor Agent, ingestion-time transformations that trim cost, a workspace and table design that matches your retention economics, workbooks that turn KQL into something an on-call engineer can actually read, metric and log alerts that scale across resources, and action groups that hand off to automation instead of paging a human at 3am.

Mental model. Azure Monitor has a collection plane and a signal plane. The collection plane (DCRs, DCEs, the Azure Monitor Agent, the Logs Ingestion API) decides what lands in a table and in what shape. The signal plane (metric alerts, scheduled-query alerts, action groups, alert processing rules) decides what humans and automation hear about. Teams that conflate the two end up alerting on raw firehose data they should have filtered at ingestion. Filter low, alert high.

1. Data Collection Rules, endpoints, and the Azure Monitor Agent

The legacy Log Analytics agent (MMA/OMS) is retired as of 31 August 2024. The replacement is the Azure Monitor Agent (AMA), and AMA does nothing on its own – it is driven entirely by Data Collection Rules associated to a machine. A DCR is a first-class ARM resource that declares three things: dataSources (what to read – perf counters, syslog facilities, Windows event logs), destinations (where to send – one or more Log Analytics workspaces), and dataFlows (which source maps to which destination table, and an optional transformation).

A Data Collection Endpoint (DCE) is the ingestion entry point. You need an explicit DCE when you use the Logs Ingestion API (custom logs pushed over REST) or when you require Private Link for ingestion via an Azure Monitor Private Link Scope (AMPLS). For plain AMA collection over public networking, a DCE is optional, but standardising on one keeps Private Link a config change rather than a re-architecture.

Register providers and create the endpoint:

az provider register --namespace Microsoft.Insights
az provider register --namespace Microsoft.OperationalInsights

# Data Collection Endpoint -- the ingestion entry point
az monitor data-collection endpoint create \
  --name dce-platform-eastus \
  --resource-group rg-observability \
  --location eastus \
  --public-network-access Enabled

Now the DCR. This one collects a focused set of Linux perf counters and syslog, sending them to a workspace. Note the streams names: built-in tables use Microsoft-Perf, Microsoft-Syslog, Microsoft-Event, etc.

{
  "location": "eastus",
  "properties": {
    "dataCollectionEndpointId": "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/dataCollectionEndpoints/dce-platform-eastus",
    "dataSources": {
      "performanceCounters": [
        {
          "name": "perf-core",
          "streams": ["Microsoft-Perf"],
          "samplingFrequencyInSeconds": 60,
          "counterSpecifiers": [
            "\\Processor(_Total)\\% Processor Time",
            "\\Memory\\Available MBytes",
            "\\LogicalDisk(_Total)\\% Free Space"
          ]
        }
      ],
      "syslog": [
        {
          "name": "syslog-warn",
          "streams": ["Microsoft-Syslog"],
          "facilityNames": ["auth", "daemon", "syslog"],
          "logLevels": ["Warning", "Error", "Critical", "Alert", "Emergency"]
        }
      ]
    },
    "destinations": {
      "logAnalytics": [
        {
          "name": "la-platform",
          "workspaceResourceId": "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.OperationalInsights/workspaces/law-platform"
        }
      ]
    },
    "dataFlows": [
      { "streams": ["Microsoft-Perf"],   "destinations": ["la-platform"] },
      { "streams": ["Microsoft-Syslog"], "destinations": ["la-platform"] }
    ]
  }
}

Create it and associate machines. Association is what actually arms the agent:

az monitor data-collection rule create \
  --name dcr-linux-platform \
  --resource-group rg-observability \
  --location eastus \
  --rule-file ./dcr-linux-platform.json

# Bind the DCR to a VM (repeat per machine, or drive via Policy at scale)
az monitor data-collection rule association create \
  --name dcra-vm-app-01 \
  --rule-id "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/dataCollectionRules/dcr-linux-platform" \
  --resource "/subscriptions/<sub>/resourceGroups/rg-fleet/providers/Microsoft.Compute/virtualMachines/vm-app-01"

At fleet scale you never run that association by hand. Use the built-in Azure Policy initiative that installs AMA and creates the association from a DCR parameter, assigned at a management-group scope with a DeployIfNotExists effect and a managed identity for remediation. One machine or ten thousand, the same DCR is the unit of intent.

2. Ingestion-time transformations and KQL filtering for cost control

This is the highest-leverage feature in the whole platform and the one most teams have never enabled. A transformation is a KQL snippet attached to a dataFlow that runs at ingestion time, before data is billed and stored. You can drop rows, drop columns, redact PII, and project new computed fields. Because billing is on ingested volume, a transformation that filters 60% of chatty Information-level syslog is a direct, permanent line-item reduction.

The transform operates on a pipeline variable named source and must project the columns that match the destination table’s schema. Add a transformKql to the relevant data flow:

"dataFlows": [
  {
    "streams": ["Microsoft-Syslog"],
    "destinations": ["la-platform"],
    "transformKql": "source | where SeverityLevel != 'info' | where ProcessName !in ('CRON','sudo') | project TimeGenerated, Computer, Facility, SeverityLevel, SyslogMessage"
  }
]

A few rules that bite people:

For custom logs over the Logs Ingestion API, the transform is even more powerful because you control the input shape. A common pattern is to send fat JSON and let the transform split it into a normal column and a DynamicJson blob, or to compute a severity from a free-text message:

source
| extend Severity = case(
    Message has_cs "ERROR", "Error",
    Message has_cs "WARN",  "Warning",
    "Information")
| where Severity != "Information"
| project TimeGenerated = todatetime(EventTime), Computer, Severity, Message

Cost rule of thumb. Filter at ingestion for volume you will never query (debug chatter, health-probe 200s). Use a cheaper table plan (next section) for volume you query rarely but must retain. Never solve a cost problem by turning off collection you will wish you had during an incident.

3. Log Analytics workspace design, tables, and table-level plans

Two workspace decisions dominate the bill: how many workspaces you run, and the table plan on each table. The modern guidance is few workspaces, many tables, per-table plans – one regional platform workspace per major boundary rather than a workspace per team, because cross-workspace KQL (workspace()/union) is awkward and access control is now solvable at the table and row level.

Azure Monitor Logs offers three table plans:

Plan Use for Query Retention model
Analytics Hot, frequently queried signals (alerts, dashboards) Full KQL, fast Interactive retention (up to long term)
Basic High-volume, occasionally queried logs (verbose app/network logs) KQL subset, per-query billed Short interactive + long-term archive
Auxiliary Very high-volume, low-fidelity (raw audit, verbose firewall) Limited KQL, lowest ingest cost Long-term, cheapest ingest

Set retention with two dials: interactive retention (queryable without restore) and total retention (interactive + cheap long-term archive). Alert rules and dashboards must read from interactive retention; archived data needs a search job or restore first.

# Create the workspace
az monitor log-analytics workspace create \
  --resource-group rg-observability \
  --workspace-name law-platform \
  --location eastus \
  --retention-time 90

# Move a chatty custom table to the Basic plan and set retention split
az monitor log-analytics workspace table update \
  --resource-group rg-observability \
  --workspace-name law-platform \
  --name AppVerbose_CL \
  --plan Basic \
  --retention-time 30 \
  --total-retention-time 365

Pair this with table-level RBAC so an app team sees its own *_CL tables but not the platform security tables, instead of minting a workspace per team just to scope access.

4. Workbooks: parameters, queries, and reusable visual templates

A workbook is a JSON template (an ARM resource of type Microsoft.Insights/workbooks) that combines parameters, KQL query steps, text, and visualisations. The feature that makes them reusable – rather than a screenshot with extra steps – is parameters: a parameter is itself usually a KQL query, and downstream steps interpolate it with {ParamName}.

The pattern that scales: a top-of-workbook time-range parameter plus a resource/subscription picker, then every query references both. Here is the parameter-and-query skeleton inside the workbook items array:

{
  "type": 9,
  "content": {
    "parameters": [
      {
        "name": "TimeRange",
        "type": 4,
        "isRequired": true,
        "value": { "durationMs": 3600000 }
      },
      {
        "name": "Subscription",
        "type": 6,
        "query": "summarize by subscriptionId",
        "queryType": 1,
        "crossComponentResources": ["value::all"]
      }
    ]
  }
}

A query step that consumes them. Note {TimeRange} expands into a full where TimeGenerated ... clause and the time-brush feeds the chart automatically:

Perf
| where TimeGenerated {TimeRange}
| where CounterName == "% Processor Time" and InstanceName == "_Total"
| summarize avg(CounterValue) by Computer, bin(TimeGenerated, 5m)
| render timechart

Two practices keep workbooks maintainable. First, pin parameter queryType and crossComponentResources so the same template works whether it is scoped to one resource or an entire subscription. Second, template it, then publish as a gallery template via Bicep so every team gets the same “service health” workbook rather than forking ten copies:

resource wb 'Microsoft.Insights/workbooks@2023-06-01' = {
  name: guid('platform-health-workbook')
  location: location
  kind: 'shared'
  properties: {
    displayName: 'Platform Health'
    category: 'workbook'
    sourceId: workspaceResourceId
    serializedData: loadTextContent('./workbooks/platform-health.json')
  }
}

5. Metric alerts, dynamic thresholds, and multi-resource scoping

Metric alerts evaluate platform metrics (or custom metrics) on a near-real-time, pre-aggregated stream – they are cheap, fast, and stateful. Two capabilities make them scale. Multi-resource scope lets one alert rule watch every VM in a resource group or subscription of the same type, so you author one rule instead of one-per-VM. Dynamic thresholds replace a hand-picked number with a machine-learned band over the metric’s history, which is the only sane choice for metrics whose “normal” varies by time of day.

A static, multi-resource CPU alert over an entire resource group:

az monitor metrics alert create \
  --name "vm-cpu-high" \
  --resource-group rg-observability \
  --scopes "/subscriptions/<sub>/resourceGroups/rg-fleet" \
  --target-resource-type "Microsoft.Compute/virtualMachines" \
  --target-resource-region eastus \
  --condition "avg Percentage CPU > 85" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 2 \
  --action "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"

For dynamic thresholds the condition uses the dynamic operator with a sensitivity and a violation count (4 violations out of 4 periods is far less noisy than 1 of 1):

az monitor metrics alert create \
  --name "vm-cpu-dynamic" \
  --resource-group rg-observability \
  --scopes "/subscriptions/<sub>/resourceGroups/rg-fleet" \
  --target-resource-type "Microsoft.Compute/virtualMachines" \
  --target-resource-region eastus \
  --condition "avg Percentage CPU >< dynamic medium 4 of 4" \
  --window-size 5m \
  --evaluation-frequency 5m \
  --severity 2 \
  --action "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"

Auto-mitigation matters. Metric alerts are stateful: a fired alert auto-resolves when the condition clears (default behaviour), and the action group is notified of resolved as well as fired. Do not build alert logic that assumes you must manually close alerts – wire your downstream automation to handle the resolved signal too.

6. Scheduled query (log) alerts and stateful alert processing

When the signal lives in logs rather than a metric – “more than 20 5xx responses from one pod in 5 minutes,” “a privileged role was assigned” – you need a scheduled query rule (Microsoft.Insights/scheduledQueryRules, API version 2023-12-01 and later, sometimes called Log Alerts v2). It runs KQL on a schedule, compares an aggregated result to a threshold, and fires.

The two settings that separate a good log alert from an alert storm are stateful alerts (autoMitigate) and dimensions. Dimensions split one rule into one alert per value of a grouping column – so a rule grouped by Computer fires a separate, independently-resolving alert per machine, instead of one giant alert that flaps.

az monitor scheduled-query create \
  --name "syslog-error-burst" \
  --resource-group rg-observability \
  --scopes "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.OperationalInsights/workspaces/law-platform" \
  --condition "count 'errs' > 20" \
  --condition-query errs='Syslog | where SeverityLevel in ("err","crit","alert","emerg")' \
  --dimension "Computer" \
  --window-size 5m \
  --evaluation-frequency 5m \
  --severity 2 \
  --auto-mitigate true \
  --action-groups "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"

Three principal-level rules for log alerts:

  1. Aggregate inside the query, not in your head. The rule compares a single aggregated number per dimension to the threshold. summarize count() by Computer, bin(TimeGenerated, 5m) keeps the evaluation deterministic.
  2. Keep evaluation-frequency >= the data latency. Log ingestion has minutes of latency; evaluating every 1 minute against data that arrives every 3 produces false negatives and duplicate fires. Match frequency to reality.
  3. Read from interactive retention only. Alert queries cannot reach archived (long-term) data without a restore. If a table is on the Basic/Auxiliary plan with short interactive retention, your alert window must fit inside it.

7. Action groups, alert processing rules, and suppression

An action group is the reusable fan-out target: a named bundle of notifications (email, SMS, push, voice) and actions (webhook, Logic App, Function, Automation Runbook, ITSM connector). Every alert type – metric, log, activity log, Service Health – points at the same action group resource, so you manage on-call routing in one place.

az monitor action-group create \
  --name ag-platform-oncall \
  --resource-group rg-observability \
  --short-name pltoncall \
  --action email oncall-lead [email protected] \
  --action webhook pagerduty https://events.pagerduty.com/integration/<key>/enqueue \
  --action logicapp incident-workflow \
    "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Logic/workflows/wf-incident" \
    "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Logic/workflows/wf-incident/triggers/manual/paths/invoke"

The piece teams miss is the alert processing rule (Microsoft.AlertsManagement/actionRules). It sits between alerts and action groups and does two jobs without touching a single alert rule:

Maintenance-window suppression across a resource group:

az monitor alert-processing-rule create \
  --name "suppress-maint-window" \
  --resource-group rg-observability \
  --scopes "/subscriptions/<sub>/resourceGroups/rg-fleet" \
  --rule-type RemoveAllActionGroups \
  --filter-severity Equals Sev2 Sev3 \
  --schedule-recurrence-type Weekly \
  --schedule-start-time "02:00:00" \
  --schedule-end-time "04:00:00" \
  --schedule-recurrence Sunday \
  --description "Mute Sev2/Sev3 during Sunday patch window"

This is how you keep a noisy estate humane: the alert rules stay armed and the processing layer decides who hears them and when.

8. Automation hooks to Logic Apps, Functions, and webhooks

The point of all of the above is to do something without a human. An action group can call a Logic App, an Azure Function, or a raw webhook, passing the alert as JSON. Use the common alert schema so every downstream gets the same envelope regardless of whether a metric or log alert fired – otherwise your Function has to parse three different payload shapes.

A Function that auto-remediates by parsing the common schema and restarting a service (sketch, Node.js):

module.exports = async function (context, req) {
  const alert = req.body?.data?.essentials;
  if (!alert) { context.res = { status: 400, body: "no alert payload" }; return; }

  context.log(`Alert ${alert.alertRule} is ${alert.monitorCondition} (${alert.severity})`);

  // Only act on a freshly fired alert, ignore the auto-resolve callback
  if (alert.monitorCondition === "Fired") {
    const target = alert.alertTargetIDs?.[0];
    context.log(`Remediating ${target}`);
    // ... call ARM / Az SDK to restart/scale the resource ...
  }
  context.res = { status: 202, body: "accepted" };
};

The two non-negotiables for automation handlers:

For richer orchestration – approvals, multi-step runbooks, ServiceNow tickets – a Logic App is the better target: enable the common alert schema on the action, and the trigger body is the same well-known structure, no custom parsing.

Verify

Walk the full chain before declaring victory:

  1. Data is landing. After associating a DCR, query the target table (Perf | take 10, Syslog | take 10) and confirm rows with a recent TimeGenerated. No rows means a bad association or a transform that dropped TimeGenerated.
  2. The transform actually fired. Compare row counts before and after enabling transformKql. The volume drop should match the filter. Confirm _IsBillable and ingested GB in Usage reflect the reduction.
  3. Table plans are set. Run az monitor log-analytics workspace table show and confirm the plan, retentionInDays, and totalRetentionInDays on each high-volume table match intent.
  4. Workbook parameters cascade. Open the workbook, change the time range and subscription pickers, and confirm every step re-queries. A step that ignores {TimeRange} is hardcoded – fix it.
  5. A metric alert fires and resolves. Drive a test load (or lower the threshold temporarily) and confirm the alert moves to Fired, the action group notifies, and it auto-resolves when the condition clears.
  6. A log alert respects dimensions. Generate errors on two machines and confirm you get two independent alerts grouped by Computer, each resolving on its own.
  7. Suppression works. Inside the alert-processing-rule window, fire a matching-severity alert and confirm no notification is sent while the alert still records in the portal.
  8. Automation is idempotent. Invoke the Function/Logic App twice with the same payload and confirm a single net effect.

Enterprise scenario

A payments platform team ran ~1,400 VMs plus AKS across three regions and watched their Log Analytics spend cross a six-figure annual run rate. The forensic finding: a single chatty Syslog stream and a verbose application diagnostic table accounted for roughly 70% of ingested volume, and almost none of it was ever queried – it existed because the old MMA config collected everything and nobody had revisited it after the migration to AMA. The constraint was hard: the security team had a regulatory requirement to retain authentication and audit events for one year, so “just stop collecting” was off the table for that slice.

They solved it with the collection plane, not the bill. First, an ingestion-time transformation on the syslog data flow dropped Information-level rows and noise from cron/health-probe processes, which alone cut syslog volume by more than half at zero query-experience cost. Second, the verbose app table was moved to the Basic plan with 30 days interactive and 365 days total retention – the rare incident query could still reach it, but ingest cost per GB fell sharply. Third, the regulated auth/audit events were split into their own table kept on the Analytics plan so security’s alerts and one-year retention were untouched. The net was a ~45% reduction in monthly ingestion cost with no loss of any signal anyone actually used, and every change shipped as a reviewed DCR/table change in the platform repo. The load-bearing change was a few lines of KQL on a data flow:

source
| where SeverityLevel !in ("info", "debug", "notice")
| where ProcessName !in ("CRON", "systemd", "kubelet-health")
| project TimeGenerated, Computer, Facility, SeverityLevel, SyslogMessage

The lesson the team took away: in Azure Monitor, cost is a collection design decision, not a billing surprise. Once the DCR was the unit of intent, “what do we collect and what does it cost” became a pull request instead of a quarterly autopsy.

Checklist

AzureAzure MonitorObservabilityAlertingLog Analytics

Comments

Keep Reading