Security Azure

Defender XDR Advanced Hunting: Custom Detection Rules and Automatic Attack Disruption

A SIEM is only as good as the questions you ask it. Defender XDR’s advantage over a raw log lake is that endpoint, identity, email, and cloud-app telemetry are already normalized into one schema and pre-correlated into incidents. Advanced hunting is where you turn that schema into detection engineering: write a cross-domain KQL query that follows an attacker across the kill chain, promote it to a scheduled custom detection that takes its own response actions, and let automatic attack disruption contain the worst-case scenarios faster than any human SOC can. This guide is the workflow I run with platform and security teams to get from “interesting query” to “rule that isolates a host at 02:00 without paging anyone.”

Everything here assumes Defender for Endpoint Plan 2 (included in Microsoft 365 E5 / E5 Security). Advanced hunting and custom detections do not exist on Plan 1.

1. The unified advanced hunting schema

The single biggest reason to hunt in Defender XDR rather than bare Sentinel is that the workloads share a schema. You pivot from a process to the user who ran it to the email that delivered the payload to the SaaS app they then logged into, all without join-ing across disparate log sources. The tables you will live in:

Domain Core tables Backed by
Device DeviceProcessEvents, DeviceNetworkEvents, DeviceFileEvents, DeviceLogonEvents, DeviceRegistryEvents Defender for Endpoint
Identity IdentityLogonEvents, IdentityDirectoryEvents, IdentityInfo, IdentityQueryEvents Defender for Identity
Email EmailEvents, EmailAttachmentInfo, EmailUrlInfo, EmailPostDeliveryEvents, UrlClickEvents Defender for Office 365
Cloud apps CloudAppEvents, OAuthAppInfo Defender for Cloud Apps
Correlation AlertInfo, AlertEvidence All workloads (the join key)

IdentityLogonEvents is the one people misread: it carries both on-prem AD auth (from Defender for Identity sensors) and sign-ins to Microsoft online services seen via Defender for Cloud Apps. The AlertEvidence table is the connective tissue. It lists every file, IP, URL, user, device, and mailbox that a Defender alert touched, keyed by AlertId, which lets you start from a known-bad alert and fan out to everything related.

Open the hunting console at security.microsoft.com -> Hunting -> Advanced hunting. A first orientation query — what’s actually flowing, and at what volume:

union withsource=TableName DeviceProcessEvents, IdentityLogonEvents, EmailEvents, CloudAppEvents
| where Timestamp > ago(1h)
| summarize Events = count() by TableName
| sort by Events desc

That row count tells you which tables are cheap to scan and which need tight filters — CloudAppEvents and DeviceProcessEvents are usually your highest-volume tables and the ones that will time a query out if you are sloppy.

2. Writing cross-domain correlation queries

A single-table query is a search. A detection follows the kill chain. The pattern that delivers the most value: deliver (email) -> execute (device) -> persist/move (identity). Here is a hunt for a phishing-delivered payload that actually ran, correlating EmailEvents to DeviceProcessEvents through the recipient’s account.

let lookback = 3d;
// Stage 1: malicious or junked inbound mail
let suspiciousMail =
    EmailEvents
    | where Timestamp > ago(lookback)
    | where EmailDirection == "Inbound"
    | where ThreatTypes has_any ("Malware", "Phish")
        or DeliveryAction == "Blocked"
    | project MailTime = Timestamp, RecipientEmailAddress, SenderFromAddress,
              NetworkMessageId, Subject;
// Stage 2: process execution by those recipients shortly after delivery
DeviceProcessEvents
| where Timestamp > ago(lookback)
| where InitiatingProcessFileName in~ ("winword.exe", "excel.exe", "outlook.exe")
| where FileName in~ ("powershell.exe", "cmd.exe", "wscript.exe", "mshta.exe", "rundll32.exe")
| project ProcTime = Timestamp, DeviceName, AccountUpn,
          FileName, ProcessCommandLine, InitiatingProcessFileName
| join kind=inner suspiciousMail on $left.AccountUpn == $right.RecipientEmailAddress
| where ProcTime between (MailTime .. (MailTime + 1h))
| project ProcTime, MailTime, DeviceName, AccountUpn, SenderFromAddress,
          Subject, InitiatingProcessFileName, FileName, ProcessCommandLine
| sort by ProcTime desc

The signal is the temporal join: an Office app spawning a scripting engine within an hour of a flagged email to the same user. Neither half is conclusive alone; together they are a high-fidelity “the user opened the lure and it executed.”

A second example — impossible-travel-style credential abuse correlated with on-host activity — leans on the identity tables:

let window = 1d;
IdentityLogonEvents
| where Timestamp > ago(window)
| where LogonType == "Interactive" or Protocol == "OAuth2"
| where isnotempty(IPAddress) and isnotempty(AccountUpn)
| summarize Countries = dcount(Location), LocationSet = make_set(Location, 10),
            IPs = make_set(IPAddress, 10), arg_min(Timestamp, Location)
          by AccountUpn, bin(Timestamp, 1h)
| where Countries >= 2

When you find something, do not rebuild context by hand. Select a row and use Go hunt to pivot every entity (account, device, IP) into a fresh query, or Link to incident to attach the result rows as evidence on an existing incident.

3. Tuning queries for performance

Custom detections inherit the performance of the query behind them, and a rule that times out simply does not run. Three rules of thumb that keep hunts fast and within quota.

Filter on Timestamp first, and keep the window honest. The lookback in your query should match the rule’s run frequency. A rule that runs hourly should look back ~1 hour (plus a small overlap buffer), not 7 days — re-scanning a week every hour is wasted quota and duplicate alerts.

Filter before you join, and put the smaller, more-selective table on the left. KQL evaluates the left side of an inner join first. Reduce both sides with where and project before joining so you are not shuffling fat rows.

Aggregate with summarize and cap your sets. make_set() and make_list() accept a max-size argument — use it. Unbounded sets on high-cardinality columns blow up memory.

DeviceNetworkEvents
| where Timestamp > ago(1h)                                  // 1. time filter first
| where RemotePort in (443, 8443)                            // 2. cheap filters next
| where isnotempty(RemoteUrl)
| summarize ConnCount = count(),
            Hosts = make_set(DeviceName, 50)                 // 3. bounded set
          by RemoteUrl, bin(Timestamp, 10m)
| where ConnCount > 100                                      // 4. threshold last

Two hard limits to design around. A custom detection only ever surfaces the first 150 results per run, so a noisy query silently truncates — aggregate or threshold until each run returns well under that. And the rule’s lookback cannot exceed 30 days. Validate timing with summarize count() by bin(Timestamp, 1h) to confirm your window actually contains the events you expect before you schedule anything.

4. Promoting hunts to custom detection rules

Once a query is reliable and quiet, promote it. From the advanced hunting editor, click Create detection rule. Two non-negotiable requirements:

  1. The query must project an entity column the platform can act on and an alert-mapping column — typically Timestamp, plus one or more of DeviceId, AccountObjectId / AccountSid, RecipientEmailAddress, FileName + SHA1. No mappable entity, no rule.
  2. The query must reference at least one Defender XDR table for automated response actions to be available. Pure-Sentinel-only queries can alert but cannot isolate a device.

Frequency options and what they cost:

Frequency Lookback used Use for
Continuous (NRT) streaming, every few minutes Highest-severity, single-table detections
Every hour last 1 hour Most production rules
Every 3 / 12 / 24 hours matching window Low-urgency, broad-sweep hunts

Near-real-time has constraints (no multi-table join in NRT; one event table), so reserve it for tight single-table logic. Then wire response actions — this is the payoff. For our phishing-execution rule, isolate the device and collect forensics:

Create detection rule
  Alert:    Title: "Office app spawned script engine after flagged email"
            Severity: High   Category: Execution
            MITRE techniques: T1566 (Phishing), T1059 (Command and Scripting Interpreter)
  Impacted entities:
            Device:  DeviceId
            Mailbox: RecipientEmailAddress
  Actions on devices:
            [x] Isolate device (Full)
            [x] Collect investigation package
            [x] Run antivirus scan
  Actions on users:
            [x] Mark user as compromised   (Defender for Identity)
  Frequency: Every hour

Available response actions, by entity:

Start every new rule with no automated actions, severity Informational, running for a week. Read the alert volume, tune the false positives out, then attach Isolate. A rule that auto-isolates on a bad assumption is an outage you wrote yourself.

5. Configuring automatic attack disruption

Custom detections are your logic. Automatic attack disruption is Microsoft’s — a built-in capability that correlates millions of signals across endpoint, identity, email, and SaaS into a single high-confidence incident, identifies the assets the attacker controls, and contains them in real time, independent of your AIR settings. It targets the scenarios where minutes matter: human-operated ransomware, business email compromise (BEC), and adversary-in-the-middle (AiTM).

You do not write the detections; you enable the org-wide response surface and let them fire. The two automated actions it takes:

Prerequisites that actually gate it: Defender for Endpoint devices in block mode with automated investigation enabled, Defender for Identity deployed with the action account configured, and the relevant workloads (Office 365, Cloud Apps) connected. Confirm and tune in Settings -> Microsoft Defender XDR -> Automatic attack disruption, where you can scope automated response exclusions for sensitive assets (a domain controller you never want auto-contained, a service account you cannot afford to disable).

Disrupted incidents are labelled so the SOC can see the machine acted:

Incident title: BEC financial fraud attack launched from a compromised account (attack disruption)
Status:         Active
Tags:           Attack disruption
Actions taken:  User <upn> disabled (Defender for Identity)
                Device <hostname> contained (Defender for Endpoint)

When an action lands, the analyst’s job is to validate and release (undo containment / re-enable the user) once remediated — the platform does not auto-rollback.

6. Managing AIR and approval levels

Automated investigation and remediation (AIR) is the layer between “alert raised” and “human triages.” When an alert fires, AIR launches an investigation, walks the related entities, reaches verdicts (Malicious / Suspicious / No threat), and proposes remediation. Whether those remediations execute automatically depends on the device group’s automation level, set in Settings -> Endpoints -> Device groups.

Automation level Behavior
Full Remediate automatically, no approval — recommended by Microsoft
Semi (require approval for all folders) Every remediation waits for an analyst
Semi (core folders) Auto-remediate non-core; approve actions in OS folders
No automated response Investigate only; remediation is manual

The mature posture is Full automation on standard workstation groups (Microsoft’s data shows it remediates more threats with no increase in false-positive harm) and Semi (core folders) on servers and Tier-0 device groups where you want eyes on anything touching system32. Track and approve pending actions in the Action center (security.microsoft.com -> Actions & submissions -> Action center), which is also where you bulk-undo if a custom detection or AIR over-reaches.

7. A reusable hunting library mapped to MITRE ATT&CK

Detection engineering scales only if hunts are version-controlled, tagged to technique, and reviewable — not pasted into the portal and forgotten. Keep them in Git as .kql files with a metadata header, and map every one to ATT&CK so you can reason about coverage instead of counting rules.

detections/
  T1110.003-password-spray-identitylogon.kql
  T1566.001-phish-attachment-exec.kql
  T1486-ransomware-mass-file-rename.kql
  T1098.005-oauth-consent-grant-abuse.kql

A lightweight, machine-readable header on each file:

// name: Password spray against on-prem AD
// mitre: T1110.003
// tactic: CredentialAccess
// severity: Medium
// frequency: 1h
// entities: AccountSid, IPAddress
// version: 3
IdentityLogonEvents
| where Timestamp > ago(1h)
| where ActionType == "LogonFailed"
| summarize FailedAccounts = dcount(AccountUpn),
            Accounts = make_set(AccountUpn, 25)
          by IPAddress, bin(Timestamp, 1h)
| where FailedAccounts >= 10        // one source, many accounts == spray

Sync the library with the Defender XDR custom detection API (under the microsoft.graph.security endpoints) so a CI pipeline is the source of truth and the portal is just the runtime — review in PRs, deploy on merge.

Verify

Confirm each layer is actually live before you trust it.

EmailEvents
| where Timestamp > ago(1h)
| summarize EventCount = count() by bin(Timestamp, 10m)

Enterprise scenario

A retail platform team running Microsoft 365 E5 across ~14,000 endpoints turned on a custom detection for AiTM session-cookie reuse — sign-in from a new ASN immediately followed by a high-value mailbox rule creation — and wired it to Disable user with Full automation on every device group. The detection was sound. The blast radius was not: on the second night it fired on a shared finance service mailbox during a legitimate quarter-close batch run from a new datacenter egress IP, disabled the account, and broke an automated payments reconciliation job feeding SAP. The on-call SOC analyst could see the disrupted incident but did not have rights to re-enable the identity, so recovery waited on an identity admin escalation — about 40 minutes of failed jobs.

The constraint: they needed aggressive auto-response for real users but could not let it touch a small set of Tier-0 service identities or shared mailboxes. The fix was two-layered. First, they scoped the custom detection to exclude service accounts by filtering on an Entra group, so the rule never raised an actionable alert for those identities. Second, they added those same accounts to the automated response exclusions under automatic attack disruption, so even Microsoft’s built-in BEC disruption would not disable them — defense in depth against both their logic and Microsoft’s.

// Exclude Tier-0 / service identities by Entra group membership before alerting
let excluded = IdentityInfo
    | where Timestamp > ago(1d)
    | where GroupMembership has "SG-NoAutoContain"     // managed Entra group
    | distinct AccountUpn;
IdentityLogonEvents
| where Timestamp > ago(1h)
| where Protocol == "OAuth2" and isnotempty(IPAddress)
| where AccountUpn !in~ (excluded)                     // never auto-disable these
| summarize ASNs = dcount(ISP), arg_min(Timestamp, IPAddress) by AccountUpn, bin(Timestamp, 1h)
| where ASNs >= 2

The lesson the team baked into their standard: automated response and exclusion lists are the same design decision. Every rule that can disable a user or isolate a device ships with an explicit, Entra-group-driven exclusion of Tier-0 assets, reviewed in the same PR as the detection logic. Aggressive automation is safe only when its boundaries are as deliberate as its triggers.

Checklist

Defender-XDRadvanced-huntingcustom-detectionsattack-disruptionKQLSOC

Comments

Keep Reading