KQL Threat Hunting in Microsoft Defender: Full Guide

Most KQL tutorials hand you a pile of operators and leave you to work out how they fit together. This guide does the opposite. It walks the whole arc of KQL threat hunting in Microsoft Defender XDR as one continuous workflow — filtering a huge event table down to what matters, aggregating it into a signal, parsing the fields buried inside command lines, correlating separate telemetry streams, and finally shipping the result as a custom detection rule that runs on its own. By the end you can take a blank Advanced Hunting query editor and turn it into an automated detection.

Key Takeaways

  • KQL threat hunting in Microsoft Defender follows a repeatable pipeline: filter, aggregate, parse, join, then promote the query to a detection rule.
  • Filtering on Timestamp first and using token-aware operators like has is the difference between a query that returns in a second and one that times out.
  • The summarize operator — with count, dcount, bin, and arg_max — is where raw events become detections you can threshold.
  • Joins reconstruct an attack chain across the device, network, and identity tables, but the innerunique default and unfiltered joins are the two things that catch people out.
  • A hunting query is not a detection rule until it returns Timestamp and ReportId correctly and stops filtering on time — the last section covers exactly that.

Environment

  • Microsoft Defender XDR portal at security.microsoft.com, under Hunting → Advanced hunting.
  • At least one workload onboarded — Defender for Endpoint, Defender for Identity, or Defender for Office 365 — to populate the tables the examples use.
  • A role that can run queries, such as Security Reader or Security Operator. Creating the detection rule at the end needs a manage role like Security Administrator.
  • Advanced Hunting retains roughly 30 days of data, so every interactive time filter in this guide lives inside that window.

The Problem — From One-Off Hunts to a Standing Detection

A table like DeviceProcessEvents records every process creation across every onboarded device. On a fleet of any size that is millions of rows a day. Run a bare table name with no filter and you get a truncated, meaningless sample and a slow query. Worse, even once you can filter, an interactive hunt only finds an attack while you are sitting there running queries — and threats do not keep office hours.

So there are really two problems to solve, and they are the bookends of this guide. The first is shaping raw telemetry into a precise question: cut the volume by time and condition, count and group what is left, lift values out of messy strings, and correlate across tables. The second is taking the query that reliably answers that question and handing it to Defender to run around the clock. Everything between those two points is the five-stage pipeline below, and I have written it so each stage feeds the next using one running example — hunting suspicious PowerShell and credential attacks — rather than five disconnected demos.

KQL Threat Hunting Step 1 — Filtering with where and project

Every useful hunt starts by throwing away the 99.9% of events you do not care about. Two operators do that job: where removes rows, and project removes columns. Almost every query you ever write uses both.

Start with a table and a time filter

A KQL query begins with a table name, and data flows downward through a pipeline of operators separated by the pipe character. The single most important filter is on time, because it is the cheapest way to discard the most rows:

DeviceProcessEvents
| where Timestamp > ago(24h)

ago(24h) means "24 hours before now"; use m, h, and d for minutes, hours, and days. Put the time filter first, every time. Everything downstream now operates on a day of data instead of a month, which is the difference between a query that returns instantly and one that crawls.

Add narrowing predicates and pick string operators deliberately

With the window set, stack where clauses to narrow by condition. A practical, defensive example: PowerShell launched with an encoded command, a common obfuscation trick used to hide malicious scripts.

DeviceProcessEvents
| where Timestamp > ago(24h)
| where FileName in~ ("powershell.exe", "pwsh.exe")
| where ProcessCommandLine has "-encodedcommand"

in~ matches a list case-insensitively, which is what you want for file names. has matches a whole token rather than a substring, so it finds -encodedcommand as a word — more accurate and far faster than contains, because has can use the table index. The broader rule for KQL string operators:

  • == — exact, case-sensitive. Fastest. Use =~ for an exact case-insensitive match.
  • has — matches a full term, case-insensitive. The default choice for searching command lines, URLs, and paths.
  • contains — matches any substring. Flexible but slow on big tables; use sparingly.
  • startswith / endswith — anchored matches, useful for extensions or path prefixes.
  • in / in~ — membership in a list, case-sensitive and case-insensitive.

Prefer the most specific operator that still matches what you mean. A query full of contains will work on a small time window and fall over on a large one. The logging side of the encoded-PowerShell story is its own topic, which I covered in PowerShell script block logging with Event ID 4104.

Trim columns, sort, and sample

By default a query returns every column in the table. project selects just the ones you want, in the order you list them. Then order the output so interesting rows sit on top, and cap the count while exploring:

DeviceProcessEvents
| where Timestamp > ago(24h)
| where FileName in~ ("powershell.exe", "pwsh.exe")
| where ProcessCommandLine has "-encodedcommand"
| project Timestamp, DeviceName, AccountName, ProcessCommandLine, InitiatingProcessFileName
| sort by Timestamp desc
| take 100

project-away drops named columns while keeping the rest, and project-rename renames one — but a plain project of five or six columns covers most hunts. One gotcha: take returns an arbitrary, non-deterministic sample, not "the top 100." When you want ranked results, sort by first and then take or top.

Step 2 — Aggregating Events with summarize

Filtering gets you the right events; aggregation turns those events into a signal. One failed sign-in is noise — five hundred from one source in ten minutes is an attack. The operator that makes that leap is summarize, and it is the one most detections are built around.

Count rows, then count distinct values

The simplest aggregation counts events grouped by a column. Here is a classic password-spray shape — failed device logons grouped by account:

DeviceLogonEvents
| where Timestamp > ago(24h)
| where ActionType == "LogonFailed"
| summarize FailedCount = count(), DistinctDevices = dcount(DeviceName) by AccountName
| where DistinctDevices > 10
| sort by DistinctDevices desc

The by clause sets the grouping key, count() tallies rows in each group, and dcount() counts distinct values. The distinction matters: an account failing against one device many times looks different from an account failing across fifty — the second is a spray. Note the where sits after the summarize, filtering on the aggregated column. dcount() returns a fast estimate by default; use count_distinct() if you need an exact figure on a smaller set. This same count-and-group pattern, applied to sign-in logs, is the backbone of Entra ID password spray detection with sign-in logs.

Bucket time with bin, and keep detail with arg_max

To see a pattern over time rather than a single total, group by a time bucket. bin() rounds each timestamp down to a fixed interval, turning events into a time series that exposes bursts. And because aggregation normally discards everything except the group key, two functions let you carry detail through: make_set() collects distinct values into a list, and arg_max() returns columns from the row with the highest value of an expression.

DeviceLogonEvents
| where Timestamp > ago(24h)
| where ActionType == "LogonFailed"
| summarize FailedCount = count(),
            DistinctDevices = dcount(DeviceName),
            DevicesHit = make_set(DeviceName, 50),
            (LastSeen, LastDevice) = arg_max(Timestamp, DeviceName)
    by AccountName
| where DistinctDevices >= 10 and FailedCount >= 25
| sort by DistinctDevices desc

make_set(DeviceName, 50) attaches the actual list of devices each account hit, capped at 50, instead of just a number. The arg_max trick is more than a convenience — it is exactly how you preserve the Timestamp a detection rule requires when you aggregate, which the final section depends on. Threshold-tuning like the numbers above is the whole art of detection engineering, a recurring theme in the Sigma rules for SIEM detection work on this blog.

Step 3 — Parsing Command Lines with parse, extract, and split

Half the useful information in security telemetry is buried inside a single string — a command line, a URL, a file path. You cannot group by "the DLL" or "the script argument" because those are not columns; they are substrings. KQL gives you three tools to lift them out, each suited to a different shape of string.

extract for one patterned value, parse for a predictable layout

When you want one specific value, extract takes a regular expression with a capture group and returns what the group matched. Here, pulling the encoded payload that follows -encodedcommand:

DeviceProcessEvents
| where Timestamp > ago(24h)
| where FileName in~ ("powershell.exe", "pwsh.exe")
| where ProcessCommandLine has "-encodedcommand"
| extend EncodedPayload = extract(@"-[eE]ncodedCommand\s+([A-Za-z0-9+/=]+)", 1, ProcessCommandLine)
| project Timestamp, DeviceName, AccountName, EncodedPayload, ProcessCommandLine

The second argument 1 is the capture group to return, and extend adds the result as a new column. The @"..." prefix is a verbatim string literal — it stops KQL treating backslashes as escapes, so your regex stays readable. When a string follows a consistent left-to-right layout and you want several pieces at once, parse is cleaner than stacking extracts:

DeviceProcessEvents
| where Timestamp > ago(24h)
| where FileName =~ "rundll32.exe"
| parse ProcessCommandLine with * "rundll32.exe" DllAndFunction
| project Timestamp, DeviceName, DllAndFunction, ProcessCommandLine

The * matches and discards anything before the literal, and whatever follows is captured. parse is best on structured, machine-generated text where the markers are reliable; when the layout is not guaranteed, fall back to extract, which fails gracefully by returning an empty string.

split for delimited strings, then hunt on the result

For delimited strings — paths, CSV, colon-separated fields — split turns the string into an array you can index into. The payoff of all three is the same: once a value is a column, every operator you already know works on it. Here, parsing the loaded item out of rundll32.exe calls and grouping to surface the rare ones, a common shape for catching proxy execution:

DeviceProcessEvents
| where Timestamp > ago(7d)
| where FileName =~ "rundll32.exe"
| extend LoadedItem = extract(@"rundll32(?:\.exe)?\s+(.+)", 1, ProcessCommandLine)
| where isnotempty(LoadedItem)
| summarize Count = count(), Devices = dcount(DeviceName) by LoadedItem
| sort by Count asc

Sorting ascending puts the rarest invocations on top, which is often where the interesting activity hides — the item that ran on exactly one device once. The where isnotempty(...) drops rows where extraction produced nothing. Hunting rundll32 abuse is catalogued under MITRE ATT&CK T1218.011, Rundll32, and parsing is what lets you pivot on the part that matters.

Step 4 — Correlating Tables with join and let

A single table tells one kind of story. The interesting questions live in the gaps between tables: which process made that network connection, on a device where that account just signed in. join does the correlation; let keeps the resulting queries readable.

Know the join kinds, and filter both sides first

A join combines two tables by matching a shared column. The kinds that matter for hunting:

  • innerunique — the default. Matches on the key but first deduplicates keys on the left table. This catches people out because it is not the SQL inner join they expect.
  • inner — a standard inner join: every matching pair, no deduplication.
  • leftouter — every left row, with right columns where they match and nulls where they do not. Good for enriching events with optional context.
  • leftanti — left rows with no match on the right. Excellent for "happened here but not there," like first-seen detections.

Always state the kind explicitly with kind=. Here is the foundational pattern — connecting outbound HTTPS connections to the PowerShell process that made them, with both sides filtered before they meet:

DeviceNetworkEvents
| where Timestamp > ago(24h)
| where RemotePort == 443 and ActionType == "ConnectionSuccess"
| join kind=inner (
    DeviceProcessEvents
    | where Timestamp > ago(24h)
    | where FileName in~ ("powershell.exe", "pwsh.exe")
) on DeviceId, InitiatingProcessId
| project Timestamp, DeviceName, InitiatingProcessFileName, RemoteIP, RemoteUrl, ProcessCommandLine

The biggest performance mistake with joins is correlating two large, unfiltered tables and filtering afterward. Joins are expensive, so push every filter as early as it can go on both sides, and put the smaller, more selective table on the left — KQL optimizes around the left side.

Tidy hunts with let, and find what is missing with leftanti

As queries grow, let statements keep them legible by naming intermediate pieces — a scalar, a list, or an entire subquery — up front. Some of the best detections are about absence, and leftanti expresses "new" or "unexpected" cleanly. Here, destinations contacted in the last day that were absent from the previous week:

let knownDestinations = DeviceNetworkEvents
    | where Timestamp between (ago(8d) .. ago(1d))
    | distinct RemoteUrl;
DeviceNetworkEvents
| where Timestamp > ago(1d)
| where isnotempty(RemoteUrl)
| join kind=leftanti knownDestinations on RemoteUrl
| summarize Count = count(), Devices = dcount(DeviceName) by RemoteUrl
| sort by Devices desc

This is a first-seen hunt, and new-and-rare is a strong signal. Naming the lookback once means you change it in one place. Correlation like this is the backbone of multi-stage detection, the same idea behind the SIEM correlation rules for real attacks on this blog.

Step 5 — Shipping a KQL Custom Detection Rule

A hunting query you run by hand finds yesterday's attack. A custom detection rule runs that same query on a schedule and alerts you while the attack is happening. But a hunting query cannot be promoted as-is — the detection engine has firm requirements about returned columns and a couple of non-obvious timing behaviours.

Return the required columns, and preserve them through summarize

A rule must be able to timestamp its alerts and identify the affected asset, so the query has to return Timestamp and ReportId — plus DeviceId for Defender for Endpoint tables, all from the same event — and a strong entity column like DeviceId or AccountObjectId. A simple unaggregated query returns these automatically; the trouble starts with summarize, which discards every column except the group keys. The fix is the arg_max trick from Step 2:

DeviceLogonEvents
| where ActionType == "LogonFailed"
| summarize (Timestamp, ReportId) = arg_max(Timestamp, ReportId),
            FailedCount = count(),
            DistinctAccounts = dcount(AccountName)
    by DeviceId
| where DistinctAccounts >= 10 and FailedCount >= 50

Grouping by DeviceId keeps the strong entity column, and arg_max(Timestamp, ReportId) carries the latest event's timestamp and report ID through the aggregation — a single host hit by failed logons from at least ten distinct accounts.

Do not over-filter on Timestamp

Notice that detection query has no ago() clause. In an interactive hunt you always lead with where Timestamp > ago(...); in a custom detection rule you generally should not. The rule's frequency sets a fixed lookback for you, and the engine evaluates events by ingestion time to account for delays. A tight Timestamp filter on top of that can make the rule miss legitimately late-arriving events. After four steps of "filter on time first," this is the one place that rule reverses — worth saying plainly.

Create the rule, set frequency, and tune for the alert cap

With the query validated, select Create detection rule, fill in the alert details and ideally the relevant MITRE ATT&CK technique, and choose a frequency — which dictates the lookback automatically:

  • Continuous (NRT) — near real-time, evaluated as events arrive. Fastest, but single-table queries only: no joins or unions.
  • Every hour — four-hour lookback.
  • Every 3 hours — twelve-hour lookback.
  • Every 12 hours — 48-hour lookback.
  • Every 24 hours — 30-day lookback.

The single-table detection above qualifies for NRT; the join-based first-seen hunt does not, and would need a scheduled frequency. A rule raises at most 150 alerts per run, so tune thresholds against your baseline before turning it on broadly, then set the device scope and optionally attach a response action such as isolating the device. The full requirements live in Microsoft's custom detection rules documentation. Once saved, the rule runs immediately against the past 30 days and then on its schedule — a query you used to run by hand, turned into a standing control that watches continuously and can act on its own.

Frequently Asked Questions

What is the difference between has and contains in KQL?

has matches a complete term or token and can use the table index, making it fast and precise. contains matches any substring, including partial words, but cannot use the index and is slow on large tables. Use has unless you specifically need a partial-substring match.

What is the difference between count and dcount in KQL?

count() returns the number of rows in each group. dcount() returns the number of distinct values of a column, as a fast estimate. Use count() for "how many events" and dcount() for "how many unique things," such as distinct devices or IPs. For an exact distinct count on a smaller set, use count_distinct().

When should I use parse instead of extract in KQL?

Use parse when a string follows a predictable left-to-right layout and you want several fields at once; it is more readable than chaining extracts. Use extract when you want a single value identified by a regular expression, especially when its position in the string varies.

What is the default join kind in KQL?

The default is innerunique, which deduplicates the keys on the left table before matching. This differs from a SQL inner join and can return fewer rows than expected. Always specify the join kind explicitly with kind= so the behaviour is unambiguous.

How do I keep Timestamp and ReportId when I use summarize?

Use arg_max(Timestamp, ReportId) inside the summarize to pull those columns from the most recent event in each group. This satisfies the column requirements of a custom detection rule while still letting you aggregate and apply thresholds.

Why shouldn't I filter on Timestamp in a custom detection rule?

The rule's frequency applies a fixed lookback automatically, and the engine evaluates events by ingestion time to handle delays. A tight Timestamp filter on top of that can exclude late-arriving events, causing missed detections. Let the frequency control the time window instead.

Do these queries work in Microsoft Sentinel too?

The KQL operators are identical, but the schema is not. Defender XDR uses tables like DeviceProcessEvents, while Sentinel's data may live in tables such as SecurityEvent. The logic ports; the table and column names may need adjusting.

Conclusion

KQL threat hunting is not five separate skills — it is one pipeline. Filtering on time and the right operators cuts the volume; summarize with counts, distinct counts, time bins, and arg_max turns what is left into a signal; parse, extract, and split unlock the fields hidden in command lines; join and let reconstruct the chain across tables; and a handful of column and timing rules promote the query to a detection that runs on its own. Almost every detection you will write is some combination of those ideas.

KQL is a deep language and there is always more — dynamic columns, advanced functions, cross-workspace queries — but with this workflow you can take a blank query editor to an automated detection rule, which is the part that actually defends the environment. Get comfortable filtering and aggregating first; the rest is mostly about feeding cleaner, richer data into the same handful of operators.

Related Posts

Editorial note: posts on this blog are drafted with AI assistance and then reviewed, edited, and tested against a real environment before publishing. Commands, output, and screenshots come from systems I actually ran the work on.

Detection Engineering KQL Microsoft 365 Security Microsoft Defender XDR Threat Hunting
SecurityScriptographer author

About the author

SecurityScriptographer is written and maintained by one person — a defender who builds and tests the detections, scripts, and Microsoft 365 workflows here before publishing them. More about me · @twi_nox

0 comments:

Post a Comment