Flow logs are the cheapest, most underused telemetry in any cloud network. Every NSG, every VPC, every subnet is already capable of emitting a record for each connection that traverses it, and most teams either leave them off or dump them into a storage account nobody ever queries. That is a waste, because flow logs answer the questions security and platform teams ask constantly: who is talking to whom, what got denied, which workload suddenly started shipping gigabytes to an IP in a country you do not operate in. The raw logs are nearly useless on their own — they are NDJSON or space-delimited text at the scale of millions of records an hour. The value is in the pipeline: enrichment, partitioning, and the queries that turn a 5-tuple into a finding. This guide builds that pipeline on both Azure and AWS, then writes the KQL and SQL that actually surfaces problems.
1. Flow log anatomy: the 5-tuple, verdicts, and what is missing
Every flow log record is built around the 5-tuple: source IP, destination IP, source port, destination port, and protocol. To that, each platform bolts on a verdict (allowed or denied) and traffic counters.
AWS VPC Flow Logs in the default format are space-delimited fields. The ones that matter:
| Field | Meaning |
|---|---|
srcaddr / dstaddr |
Source and destination IP |
srcport / dstport |
Source and destination port |
protocol |
IANA protocol number (6 = TCP, 17 = UDP) |
action |
ACCEPT or REJECT |
bytes / packets |
Counters for the aggregation window |
log-status |
OK, NODATA, or SKIPDATA |
Azure VNet flow logs (the successor to NSG flow logs, which retired in mid-2025) emit JSON. Each flowTuple is a comma-delimited string inside the record: timestamp, source IP, dest IP, source port, dest port, protocol (T/U), direction (I/O), flow state (B begin / C continue / E end), and byte/packet counts in each direction.
What is missing from both is the part that hurts. There is no process, no user, no DNS name, no TLS SNI — flow logs are pure L3/L4. They tell you 10.2.1.40 talked to 52.x.x.x:443 and moved 4 GB; they cannot tell you it was curl exfiltrating to an S3 bucket. They are also sampled and aggregated: a single record summarizes many packets over an interval, so you lose per-packet timing. And REJECT records only appear when a rule actually denies traffic — asymmetric routing or an upstream firewall drop may produce no flow record at all. Knowing these limits keeps you from over-trusting the data.
2. Enabling flow logs with the right format and interval
Aggregation interval is the single most consequential setting. AWS offers 10-minute or 1-minute windows. One minute roughly 10x’s your record volume but is the only way to catch short-lived bursts and to make timing-based detections meaningful. For a security pipeline, use 1-minute; for pure cost accounting, 10 is fine.
Enable VPC Flow Logs to S3 with a custom format so you capture the fields detections need (default format omits tcp-flags, flow-direction, and pkt-dstaddr, which is the real destination behind a NAT/load balancer):
aws ec2 create-flow-logs \
--resource-type VPC \
--resource-ids vpc-0abc123 \
--traffic-type ALL \
--log-destination-type s3 \
--log-destination arn:aws:s3:::netflow-logs-prod/vpc/ \
--max-aggregation-interval 60 \
--destination-options FileFormat=parquet,HiveCompatiblePartitions=true,PerHourPartition=true \
--log-format '${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${start} ${end} ${action} ${log-status} ${flow-direction} ${pkt-srcaddr} ${pkt-dstaddr} ${tcp-flags}'
Two flags earn their place here: FileFormat=parquet cuts Athena scan cost dramatically versus plain text, and HiveCompatiblePartitions=true makes partition discovery automatic.
On Azure, VNet flow logs are configured through Network Watcher and require Traffic Analytics to be useful (raw logs land in a storage account; analytics sends enriched data to Log Analytics). With Bicep:
resource flowLog 'Microsoft.Network/networkWatchers/flowLogs@2024-05-01' = {
parent: networkWatcher
name: 'fl-vnet-prod'
location: location
properties: {
targetResourceId: vnet.id // VNet-level flow log
storageId: flowStorage.id
enabled: true
format: {
type: 'JSON'
version: 2
}
retentionPolicy: {
days: 7 // raw retention in storage; analytics is separate
enabled: true
}
flowAnalyticsConfiguration: {
networkWatcherFlowAnalyticsConfiguration: {
enabled: true
workspaceResourceId: logAnalytics.id
workspaceRegion: location
trafficAnalyticsInterval: 10 // minutes; 10 or 60 are the supported values
}
}
}
}
Target the VNet, not individual NSGs — VNet flow logs follow the resource even as NSGs are reassigned and survive the NSG-flow-log deprecation.
3. Storage and cost: sampling, lifecycle, and partitioning
Flow logs grow without bound. A busy VPC at 1-minute intervals can emit tens of GB per day. Three levers keep this sane.
Partitioning. Never let Athena scan the whole bucket. With HiveCompatiblePartitions=true, logs land at paths like .../year=2026/month=06/day=08/hour=14/. Create the table as partitioned and use partition projection so you never run MSCK REPAIR TABLE:
CREATE EXTERNAL TABLE vpc_flow_logs (
version int, account_id string, interface_id string,
srcaddr string, dstaddr string, srcport int, dstport int,
protocol bigint, packets bigint, bytes bigint,
start_ts bigint, end_ts bigint, action string, log_status string,
flow_direction string, pkt_srcaddr string, pkt_dstaddr string, tcp_flags int
)
PARTITIONED BY (`region` string, `day` string)
STORED AS PARQUET
LOCATION 's3://netflow-logs-prod/vpc/AWSLogs/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.region.type' = 'enum',
'projection.region.values' = 'us-east-1,eu-west-1',
'projection.day.type' = 'date',
'projection.day.range' = '2026/01/01,NOW',
'projection.day.format' = 'yyyy/MM/dd',
'storage.location.template' =
's3://netflow-logs-prod/vpc/AWSLogs/${region}/${day}'
);
Lifecycle. Keep raw logs hot for 30 days, then transition to a cheaper tier and expire. On S3:
{
"Rules": [{
"ID": "flowlog-tiering",
"Filter": { "Prefix": "vpc/" },
"Status": "Enabled",
"Transitions": [{ "Days": 30, "StorageClass": "GLACIER_IR" }],
"Expiration": { "Days": 365 }
}]
}
Sampling. AWS does not offer flow-log sampling, so control volume with interval and by scoping which VPCs log ALL versus REJECT only. On Azure, the Traffic Analytics ingestion to Log Analytics is the cost driver; set workspace retention deliberately and rely on the 10-minute processing interval rather than ingesting every raw tuple into Log Analytics. Raw storage retention (the retentionPolicy.days above) is independent and cheap — keep it short.
4. Azure Traffic Analytics: enrichment and the Log Analytics schema
This is where Azure pulls ahead. Raw VNet flow logs are just tuples in blob storage. Traffic Analytics processes them on the interval you set and writes enriched rows into the NTANetAnalytics table in Log Analytics (the modern table; the older AzureNetworkAnalytics_CL custom log is being retired alongside NSG flow logs). Enrichment adds what the raw log lacks: geo-IP on public addresses, the FlowType (IntraVNet, InterVNet, S2S, ExternalPublic, MaliciousFlow), the NSG and rule that made the decision, and VM/subnet resolution.
Key columns in NTANetAnalytics:
| Column | Use |
|---|---|
SrcIp / DestIp |
Endpoints |
DestPort |
Destination port |
FlowType |
Classifies the flow (intra/inter-VNet, external, malicious) |
FlowStatus |
Allowed or Denied |
AclRule / AclGroup |
The NSG rule and NSG that decided |
BytesSrcToDest / BytesDestToSrc |
Directional byte counts |
Country / Region |
Geo enrichment on the public endpoint |
L7Protocol |
Best-effort app protocol |
The malicious-flow enrichment cross-references known-bad IPs, so a non-empty MaliciousFlow or a FlowType == "MaliciousFlow" is a high-signal starting point.
5. Querying: denied flows, top talkers, and unexpected egress
The three workhorse queries. First, KQL against NTANetAnalytics.
Denied flows grouped by rule (find which NSG rule is doing the dropping, and what is hitting it):
NTANetAnalytics
| where TimeGenerated > ago(1h)
| where FlowStatus == "Denied"
| summarize Flows = count(), Bytes = sum(BytesSrcToDest + BytesDestToSrc)
by AclGroup, AclRule, DestPort, SrcIp
| top 50 by Flows desc
Top talkers by bytes, resolving the noisiest source/destination pairs:
NTANetAnalytics
| where TimeGenerated > ago(24h)
| where FlowStatus == "Allowed"
| extend TotalBytes = BytesSrcToDest + BytesDestToSrc
| summarize Bytes = sum(TotalBytes), Flows = count()
by SrcIp, DestIp, DestPort
| top 25 by Bytes desc
Unexpected outbound destinations — egress to public IPs in countries you do not operate in:
NTANetAnalytics
| where TimeGenerated > ago(24h)
| where FlowType == "ExternalPublic" and FlowStatus == "Allowed"
| where Country !in ("United States", "Ireland", "") // your operating geos
| summarize Bytes = sum(BytesSrcToDest), Flows = count()
by SrcIp, DestIp, Country, DestPort
| where Bytes > 10000000 // > ~10 MB
| order by Bytes desc
The same three on AWS with Athena. Note protocol is the IANA number and bytes is per-record, so we sum.
Denied flows, top offenders:
SELECT srcaddr, dstaddr, dstport, COUNT(*) AS rejects,
SUM(bytes) AS bytes
FROM vpc_flow_logs
WHERE day >= date_format(current_date - interval '1' day, '%Y/%m/%d')
AND action = 'REJECT'
GROUP BY srcaddr, dstaddr, dstport
ORDER BY rejects DESC
LIMIT 50;
Top talkers, using pkt_dstaddr to see the real destination behind a NAT gateway rather than the NAT’s own address:
SELECT srcaddr, pkt_dstaddr AS real_dst, dstport,
SUM(bytes) AS total_bytes
FROM vpc_flow_logs
WHERE day >= date_format(current_date - interval '1' day, '%Y/%m/%d')
AND action = 'ACCEPT'
AND flow_direction = 'egress'
GROUP BY srcaddr, pkt_dstaddr, dstport
ORDER BY total_bytes DESC
LIMIT 25;
Egress to non-RFC1918 destinations on odd ports — the shape of a beacon or exfil channel:
SELECT srcaddr, pkt_dstaddr, dstport,
SUM(bytes) AS bytes, COUNT(*) AS flows
FROM vpc_flow_logs
WHERE day >= date_format(current_date - interval '1' day, '%Y/%m/%d')
AND action = 'ACCEPT' AND flow_direction = 'egress'
AND NOT regexp_like(pkt_dstaddr,
'^(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.)')
AND dstport NOT IN (443, 80) -- the long tail past normal web egress
GROUP BY srcaddr, pkt_dstaddr, dstport
HAVING SUM(bytes) > 5000000
ORDER BY bytes DESC;
6. Detections: NSG misses, exfil-shaped egress, and east-west surprises
Three detection patterns worth codifying.
NSG/security-group rule misses. A rule you think is blocking traffic but is not. If you expect no host to reach the metadata service or an admin port from the workload tier, a single ACCEPT is a finding:
-- Any ACCEPT to a port that policy says should never be reachable
SELECT srcaddr, pkt_dstaddr, dstport, SUM(bytes) AS bytes
FROM vpc_flow_logs
WHERE day = date_format(current_date, '%Y/%m/%d')
AND action = 'ACCEPT'
AND dstport IN (22, 3389, 2375, 6443) -- SSH, RDP, docker, kube-api
GROUP BY srcaddr, pkt_dstaddr, dstport;
Exfil-shaped egress. Exfiltration looks like sustained, asymmetric outbound volume to one external destination. The tell is the byte ratio: far more bytes leaving than arriving. In KQL:
NTANetAnalytics
| where TimeGenerated > ago(6h)
| where FlowType == "ExternalPublic" and FlowStatus == "Allowed"
| summarize Out = sum(BytesSrcToDest), In = sum(BytesDestToSrc)
by SrcIp, DestIp
| extend Ratio = todouble(Out) / (In + 1)
| where Out > 50000000 and Ratio > 20 // >50MB out, 20:1 outbound skew
| order by Out desc
East-west surprises. A workload talking laterally to a tier it has no business reaching — a frontend connecting straight to a database subnet, bypassing the app tier. With FlowType == "IntraVNet" or "InterVNet" you can baseline expected tier-to-tier pairs and alert on any new one. The same idea in Athena keys off your subnet CIDRs:
-- Frontend subnet (10.1.1.0/24) reaching DB subnet (10.1.3.0/24) directly
SELECT srcaddr, dstaddr, dstport, COUNT(*) AS flows
FROM vpc_flow_logs
WHERE day = date_format(current_date, '%Y/%m/%d')
AND action = 'ACCEPT'
AND regexp_like(srcaddr, '^10\.1\.1\.')
AND regexp_like(dstaddr, '^10\.1\.3\.')
GROUP BY srcaddr, dstaddr, dstport;
7. Dashboards and alerting without alert fatigue
The trap is alerting on raw flow counts. Flow volume is noisy and self-similar; a threshold that catches a real incident on Tuesday fires twenty times during Monday’s batch job. Alert on derived, baselined metrics instead.
On Azure, schedule the detection KQL as a Log Analytics alert rule and require both a volume floor and a deviation from baseline, so normal traffic never trips it:
let lookback = 7d;
let baseline = toscalar(
NTANetAnalytics
| where TimeGenerated between (ago(lookback) .. ago(1d))
| where FlowType == "ExternalPublic"
| summarize avg(BytesSrcToDest)
);
NTANetAnalytics
| where TimeGenerated > ago(1h)
| where FlowType == "ExternalPublic"
| summarize Out = sum(BytesSrcToDest) by SrcIp
| where Out > baseline * 10 and Out > 100000000 // 10x baseline AND >100MB
Practical rules to keep noise down:
- Alert on denied-flow rate per rule, not per flow; a spike in REJECTs against a specific rule means scanning or a misconfigured client.
- Alert on new external destinations (a destination IP not seen in the prior 7 days), not on volume to known ones.
- Make geo and exfil detections stateful — suppress a destination once triaged, so the same benign CDN does not re-page.
- Build the dashboard (Azure Workbook or a Grafana panel over Athena) around top talkers, denied-by-rule, and external-egress-by-country. Those three panels answer 90% of investigations.
Verify
Confirm the pipeline before you trust a detection. End to end:
# 1. AWS: confirm flow logs are active and healthy (not in a failed state)
aws ec2 describe-flow-logs \
--filter "Name=resource-id,Values=vpc-0abc123" \
--query 'FlowLogs[].{Status:FlowLogStatus,Dest:LogDestination,Fmt:LogFormat}'
# 2. Generate a known flow you can find later (a denied egress)
# then confirm REJECT records appear in Athena for that src IP.
// 3. Azure: confirm Traffic Analytics is writing enriched rows
NTANetAnalytics
| where TimeGenerated > ago(2h)
| summarize Records = count(), Denied = countif(FlowStatus == "Denied")
by FlowType
| order by Records desc
A healthy result has records across multiple FlowType values and a non-zero count. If NTANetAnalytics is empty but the storage account has blobs, Traffic Analytics is disabled or pointed at the wrong workspace. If Athena returns rows but partitions look stale, your projection range or LOCATION is wrong. Generate a deliberate denied flow (curl a blocked port) and confirm it surfaces in the denied-flows query within one processing interval — that proves the loop works.
Enterprise scenario
A fintech platform team ran ~80 VPCs across three AWS accounts behind a centralized inspection VPC. Their constraint: a compliance requirement to prove no production workload egressed to non-approved countries, plus an Athena bill that had crept to four figures a month because every analyst was running SELECT * ... WHERE against an unpartitioned text-format flow-log table that scanned the entire bucket on every query.
Two changes fixed both. First, they re-created the flow logs in Parquet with Hive-compatible hourly partitions and rebuilt the Athena table with partition projection scoped to region and day — the same pattern in section 3. A query that previously scanned 400 GB now scanned a single day’s partition, dropping per-query cost by over 95% and Athena spend back under control. Second, they codified the country-egress check as a scheduled query writing results to a compliance bucket, with the destination IPs run through a geo lookup. The detection that mattered fired within a week: a batch job in a staging account had a NAT route that let it reach a package mirror hosted in a region outside their approved list. Flow logs caught it because the egress showed up as ExternalPublic to an unexpected country code, and pkt_dstaddr revealed the real mirror behind the NAT.
The core remediation snippet — the scheduled compliance query, keyed off the approved-CIDR exclusion they maintained as a list:
-- Daily compliance: egress to any public dst outside approved ranges
SELECT account_id, srcaddr, pkt_dstaddr, dstport,
SUM(bytes) AS bytes
FROM vpc_flow_logs
WHERE day = date_format(current_date - interval '1' day, '%Y/%m/%d')
AND action = 'ACCEPT' AND flow_direction = 'egress'
AND NOT regexp_like(pkt_dstaddr,
'^(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.|100\.6[4-9]\.|100\.[7-9][0-9]\.|100\.1[01][0-9]\.|100\.12[0-7]\.)')
GROUP BY account_id, srcaddr, pkt_dstaddr, dstport
HAVING SUM(bytes) > 1000000
ORDER BY bytes DESC;
The RFC1918-plus-CGNAT exclusion regex matters: it keeps internal and carrier-grade NAT ranges out of the “external egress” result so the report only shows genuinely public destinations.
Closing the loop: feeding findings back into rules
A finding that does not change a rule is just a log line. The pipeline pays off when each detection feeds a security-group or NSG change:
- A denied-flow spike against a rule that turns out to be legitimate traffic means the rule is too tight — widen it, with the flow log as the justification in the change ticket.
- An unexpected ACCEPT to an admin port means a rule is missing or too broad — add an explicit deny and re-run the detection to confirm it now shows REJECT.
- A new external destination that is legitimate (a new SaaS dependency) gets added to an allowlist; one that is not becomes a blocked CIDR and an incident.
Keep the rule changes in the same IaC repo as the flow-log config so the detection and the control live together. The denied-flow query becomes your regression test: deploy the NSG/SG change, wait one processing interval, and confirm the flow that used to ACCEPT now shows REJECT (or vice versa). That closed loop — log, query, detect, change rule, re-verify with the same query — is what turns flow logs from passive storage into an active control.