Network Flow Logs to Insight: Building a Traffic Analytics and Detection Pipeline

Flow logs are the cheapest, most under-used telemetry any cloud network already produces. Every network security group, subnet and VPC can emit one record per connection that crosses it — and most teams either leave the feature off or dump the records into a storage account nobody queries. That is a waste, because flow logs answer the questions security and platform engineers ask constantly: who is talking to whom, what got denied, which workload suddenly started shipping gigabytes to an IP in a country you do not operate in, and whether the security-group rule you think is blocking is actually blocking. The raw record on its own is nearly useless — space-delimited text or NDJSON at millions of rows an hour, a pure L3/L4 5-tuple with no process, user or DNS name. The value is never in the record; it is in the pipeline that enriches, partitions and queries it, and in the detections that turn a 5-tuple into a finding before it becomes an incident.

This guide builds that pipeline end to end on both clouds. On Azure it covers NSG flow logs v2 and their successor VNet flow logs, and the Traffic Analytics layer that enriches raw tuples with geo-IP, flow classification and known-malicious-IP matching before writing them into Log Analytics. On AWS it covers VPC Flow Logs — default and custom record formats, delivery to S3, CloudWatch Logs and Amazon Data Firehose, and the Athena partition-projection pattern that keeps query cost sane at scale. Then it writes the detections: the KQL and Athena SQL that surface data exfiltration, port and host scanning, command-and-control (C2) beaconing, denied-flow spikes and east-west lateral movement — with a portable ruleset you can lift into your own SIEM. Because this is a reference you will return to mid-investigation, the schemas, field maps, retention tiers, cost drivers and detection thresholds are all laid out as scannable tables.

By the end you will treat flow logs not as passive storage but as an active control: which fields to capture (the default AWS format omits the three that matter most), how to partition so a query scans one day instead of the whole bucket, how Traffic Analytics classifies a MaliciousFlow, and how to gate an alert on baseline-deviation-plus-volume-floor so normal batch traffic never pages you. Most importantly you will close the loop — every detection feeds an NSG/security-group change, and the same detection query becomes the regression test that proves it worked.

What problem this solves

The pain is concrete and recurring. A workload exfiltrates over HTTPS to a bucket in a region you never operated in, and nothing notices because the traffic is encrypted and the destination is “just port 443.” A compromised host runs a slow port scan, one SYN every few seconds, under every rate-based alarm. Malware beacons to its C2 every sixty seconds with a tiny, regular heartbeat that looks like health-check noise. A misconfigured rule silently lets the app tier reach the database subnet directly, bypassing the middle tier your diagram promised. None of these throw an error or show up in an application log. All are visible in flow logs — if you collect the right fields, store them so you can query them affordably, and run detections that know what these attacks look like at the 5-tuple level.

What breaks without this: teams buy an expensive NDR appliance to see traffic they were already logging for free; compliance cannot prove no production workload egressed to a non-approved country because the flow-log table is unpartitioned plain text and a single query scans 400 GB; responders reconstructing a breach find the flow logs had seven-day retention and the relevant window aged out; and platform teams leave the feature off because “it’s just noise.” The abstraction that makes cloud networking easy — you never touch the switch — also removes the SPAN port and tap you would have used on-premises. Flow logs are the cloud replacement for that visibility; a detection pipeline is what makes them usable.

Who hits this: anyone running production in Azure or AWS who needs network-layer detection, egress-compliance evidence, or breach-reconstruction. It bites hardest on teams with many VPCs/VNets across accounts (fan-out makes ad-hoc querying impossible), teams under a data-residency or PCI-DSS requirement (you must prove egress boundaries), and anyone who turned flow logs on, pointed them at a bucket, and never built the query layer — the overwhelming majority. The fix is not a new product; it is a pipeline over telemetry you already have.

To frame the whole field before the deep dive, here is every detection class this article builds, the flow-log signal it keys off, and where you run it:

Detection class	What the attacker/misconfig does	Flow-log signal	Where you run it
Data exfiltration	Sustained outbound to an external host	High `BytesSrcToDest`, high out/in ratio, `ExternalPublic`	KQL on `NTANetAnalytics` / Athena on `bytes` + `pkt_dstaddr`
Port scan (one host, many ports)	Enumerates services on a target	One src → many `DestPort`, mostly `Denied`/`REJECT`	`dcount(DestPort)` / `COUNT(DISTINCT dstport)`
Host sweep (one port, many hosts)	Enumerates hosts running a service	One src → many `DestIp`, one port	`dcount(DestIp)` / `COUNT(DISTINCT dstaddr)`
C2 beaconing	Regular heartbeat to a controller	Periodic small flows, low jitter, same dst	Inter-arrival stddev over ordered flow times
Denied-flow spike	Scanning, misconfigured client, policy probe	Surge in `Denied`/`REJECT` per rule	`countif(FlowStatus=="Denied")` per `AclRule`
East-west surprise	Lateral movement / broken segmentation	Tier-to-tier flow that should not exist	Src CIDR → Dst CIDR pair not in baseline
Rule miss (unexpected ACCEPT)	Control that should block but doesn’t	An `Allowed`/`ACCEPT` to a forbidden port	`dstport IN (22,3389,...)` AND `action='ACCEPT'`

Learning objectives

By the end of this article you can:

Read and map the Azure VNet flow log, NSG flow log v2, Traffic Analytics NTANetAnalytics, and AWS VPC Flow Log schemas field-for-field, and name what each records and — critically — what each omits.
Enable flow logs at the right scope (VNet / VPC) with the right aggregation interval and a custom AWS field set that captures pkt-dstaddr, flow-direction and tcp-flags — the three fields the default format drops that detection needs.
Explain the NSG-flow-logs → VNet-flow-logs deprecation (retirement of NSG flow logs on 30 June 2025, retirement of new NSG-flow-log creation on 30 September 2027 messaging), and migrate without losing coverage.
Build the AWS query layer: Parquet + Hive-compatible partitions, an Athena external table with partition projection (no MSCK REPAIR), and cost-scoped scans.
Understand Traffic Analytics enrichment — geo-IP, FlowType classification, NSG/rule resolution, and the known-malicious-IP match that produces a MaliciousFlow.
Write production KQL and Athena detections for exfil, port scans, host sweeps, C2 beaconing, denied-flow spikes, east-west lateral movement and rule misses, and gate alerts on baseline deviation plus a volume floor to kill alert fatigue.
Size retention and cost across raw storage, Traffic Analytics ingestion, CloudWatch Logs, Firehose and Athena — and pick the tiering that meets your compliance window without a runaway bill.
Close the loop: feed every finding back into an NSG/security-group change in IaC and use the detection query as its regression test.

Prerequisites & where this fits

You should already understand cloud networking fundamentals: what a VNet/VPC and subnet are, what an NSG (Azure network security group) and an AWS security group / network ACL do, the difference between stateful (security group) and stateless (NACL) filtering, and the 5-tuple (source IP, destination IP, source port, destination port, protocol). You should be comfortable in Log Analytics / KQL on the Azure side and Amazon Athena / Presto SQL on the AWS side, and able to run az, aws and either Bicep or Terraform. Familiarity with RFC 1918 private ranges, CGNAT (100.64.0.0/10), IANA protocol numbers (6 = TCP, 17 = UDP, 1 = ICMP), and basic TCP flags (SYN, ACK, FIN, RST) is assumed.

This sits in the Observability & Detection track and is downstream of your segmentation design. It assumes you have already drawn tier boundaries — the article Micro-Segmentation with NSGs and Application Security Groups: Tier Isolation at Scale is the control this pipeline verifies. It pairs tightly with KQL Threat Hunting Playbooks: MITRE ATT&CK Mapping, UEBA, and Hunting Notebooks, because the detections here are network-layer hunts you fold into that broader program, and with Engineering Incident Response: Runbooks, Tabletop Exercises, and Cloud Forensics, because flow logs are your primary artefact for reconstructing what a compromised host touched. When a detection finds outbound to a bad destination, the remediation often lands in a Centralized Internet Egress: FQDN Filtering, Explicit Proxy, and TLS Inspection chokepoint or an AWS Network Firewall in Production: Suricata Rule Engineering for Egress Inspection rule.

A quick map of who owns which layer during an investigation, so you route the finding to the right team fast:

Layer	What lives here	Who usually owns it	What flow logs prove about it
Workload / app tier	The process making connections	App / dev team	Which host initiated an unexpected flow
NSG / security group	Allow/deny decisions	Platform / network	Whether a rule fired (`Allowed`/`Denied`, `AclRule`)
Subnet / route table	Where traffic can go	Network team	East-west pairs, egress path via NAT
Egress edge (NAT / firewall / proxy)	Internet-bound traffic	Network / security	External destinations, `pkt_dstaddr` behind NAT
Log Analytics / S3+Athena	The query layer	SecOps / platform	The detection surface itself
SIEM / SOAR	Alerting + response	SOC	Where findings become tickets and blocks

Core concepts

Six mental models make every later section obvious.

A flow log record is a 5-tuple plus a verdict plus counters — and nothing else. Every record, on every platform, is built around the 5-tuple (source IP, destination IP, source port, destination port, protocol), plus a verdict (allowed/denied) and counters (bytes and packets over an aggregation window). That is the entire payload — no process name, user, DNS name, TLS SNI, or HTTP host header. Flow logs are pure L3/L4: they tell you 10.2.1.40 talked to 52.x.x.x:443 and moved 4 GB; they cannot tell you it was curl exfiltrating to a bucket. Every detection is a pattern over the 5-tuple, verdict, counters and time — and every detection has a blind spot where the payload would have told the truth.

Flow logs are sampled, aggregated and directional — not a packet capture. A single record summarises many packets over an interval (1 or 10 minutes on AWS; a configurable interval on Azure), so per-packet timing inside the window is lost. Records are directional: Azure gives bytes source-to-destination and destination-to-source separately; AWS emits separate records per direction with flow-direction. And a REJECT record only appears when a rule actually denies — an upstream firewall drop or an asymmetric return path may produce no record at all. Absence of a denied flow is not proof the traffic was allowed.

Aggregation interval is the single most consequential setting. AWS offers a 10-minute or 1-minute window; one minute roughly 10×'s volume but is the only way to catch short-lived bursts and make any timing detection (beaconing, scan cadence) meaningful. Azure Traffic Analytics processes on a configurable interval (10 or 60 minutes supported). For security, choose the finer interval and pay the volume; for cost accounting, the coarser one is fine.

The default AWS format is missing exactly the fields detection needs. It omits tcp-flags (SYN-scan vs established), flow-direction (ingress vs egress), and pkt-srcaddr/pkt-dstaddr (the real source/destination behind a NAT/load balancer, versus the intermediary’s address). A custom format adding those three is the difference between a pipeline that can detect a scan and one that cannot. Azure has no equivalent “missing field” problem — VNet flow logs and Traffic Analytics emit a fixed, richer schema; the analogous choice is simply enabling Traffic Analytics rather than leaving raw tuples in blob.

Enrichment turns a tuple into a finding — and Azure does it for you. Raw Azure flow logs are just tuples in blob. Traffic Analytics enriches them: geo-IP on public addresses, flow classification (IntraVNet, InterVNet, S2S, ExternalPublic, MaliciousFlow), NSG-and-rule resolution, VM/subnet name resolution, and a cross-reference against a known-malicious-IP feed. On AWS you build each yourself — geo-IP via a lookup join in Athena, RFC1918/CGNAT classification via regex, threat-intel via a join against your list. Enrichment is the difference between “an IP” and “an IP in a country you don’t operate, on a threat list, receiving 4 GB.”

A detection that doesn’t change a rule is just a log line. The pipeline pays off only when each finding feeds a control change: a denied-flow spike on legitimate traffic means the rule is too tight (widen it); an unexpected ACCEPT to an admin port means a rule is missing (add an explicit deny); a benign new external destination gets allow-listed, a malicious one blocked. Keep rule changes in the same IaC repo as the flow-log config, and use the detection query as the regression test after every change.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters to detection
5-tuple	src IP, dst IP, src port, dst port, protocol	Every record	The identity of a flow; every detection keys off it
Flow tuple (Azure)	Comma-delimited tuple string inside a VNet-flow-log record	Blob storage JSON	The raw Azure record before Traffic Analytics
NSG flow log v2	Legacy Azure flow log attached to an NSG	Network Watcher	Deprecated; migrate to VNet flow logs
VNet flow log	Modern Azure flow log attached to a VNet	Network Watcher	Survives NSG reassignment; the go-forward
Traffic Analytics	Azure processing that enriches flow logs	Log Analytics (`NTANetAnalytics`)	Adds geo, `FlowType`, rule, malicious match
`NTANetAnalytics`	The modern Traffic Analytics table	Log Analytics workspace	Where you write all Azure KQL detections
VPC Flow Log	AWS per-ENI/subnet/VPC flow record	S3 / CloudWatch / Firehose	The AWS raw record
Custom format (AWS)	A chosen field set for VPC Flow Logs	Flow-log config	Adds `tcp-flags`, `flow-direction`, `pkt-dstaddr`
Partition projection	Athena partitions computed, not cataloged	Athena table `TBLPROPERTIES`	Scoped scans without `MSCK REPAIR`
`pkt-dstaddr`	Real destination behind a NAT/LB	AWS custom field	Reveals the true external host
`FlowType`	Traffic Analytics flow classification	`NTANetAnalytics`	Filters to `ExternalPublic` / `MaliciousFlow`
Aggregation interval	Window a record summarises	Flow-log / TA config	1-min enables timing detections
Verdict	Allowed/`ACCEPT` vs Denied/`REJECT`	Every record	Denied-flow and rule-miss detections

Flow-log anatomy: the schemas, field by field

Before you can write a detection you must know exactly what each platform records. The three schemas that matter are the AWS VPC Flow Log (default and custom), the Azure VNet flow log (raw JSON in blob), and the Traffic Analytics NTANetAnalytics table (the enriched Azure surface you actually query).

AWS VPC Flow Log fields

The AWS default format is a space-delimited line of a fixed subset of fields (version 2). A custom format lets you choose from a much larger set (through version 5) and reorder them. The fields that matter for detection, and whether they are in the default set:

Field	Meaning	In default?	Why detection needs it
`version`	Flow-log format version	Yes	Parsing; v5 adds the fields below
`account-id`	Owning AWS account	Yes	Multi-account attribution
`interface-id`	ENI the flow was captured on	Yes	Which host/instance
`srcaddr` / `dstaddr`	Source / destination IP (as seen at the ENI)	Yes	The apparent 5-tuple
`srcport` / `dstport`	Source / destination port	Yes	Service identification, scan target
`protocol`	IANA protocol number (6 TCP, 17 UDP, 1 ICMP)	Yes	Protocol filtering
`packets` / `bytes`	Counters for the aggregation window	Yes	Volume, exfil sizing
`start` / `end`	Window start/end (epoch seconds)	Yes	Timing, ordering flows
`action`	`ACCEPT` or `REJECT`	Yes	Verdict — denied-flow detection
`log-status`	`OK`, `NODATA`, `SKIPDATA`	Yes	Data-quality gate
`tcp-flags`	OR of flags seen (2=SYN, 18=SYN-ACK, 1=FIN, 4=RST)	No	SYN-scan vs established distinction
`flow-direction`	`ingress` or `egress`	No	Egress-only exfil/beacon detection
`pkt-srcaddr` / `pkt-dstaddr`	Real src/dst behind NAT/LB/IGW	No	True external host, not the NAT’s IP
`traffic-path`	Egress path (IGW, VPC peering, NAT, TGW…)	No	Distinguish internet vs internal egress
`type`	Traffic type: IPv4, IPv6, EFA	No	Dual-stack coverage
`region` / `az-id`	Region / AZ of the flow	No	Multi-region attribution
`sublocation-type`	Wavelength / Outpost / Local Zone	No	Edge deployments
`pkt-src-aws-service` / `pkt-dst-aws-service`	AWS service name for the address	No	Distinguish S3/DynamoDB traffic
`flow-encryption`	Whether the flow was encrypted (VPC-level)	No	Compliance evidence

A note on tcp-flags: because a flow record aggregates a window, the field is a bitwise OR of every flag seen in that window. A value of 2 (SYN only, no ACK) across a window is the tell-tale of a half-open SYN scan — the scanner never completed the handshake. A value of 3 (SYN+FIN, an illegal combination) or 18 alone in isolation are classic scan/probe fingerprints.

Azure VNet flow log — the raw JSON record

Azure raw VNet flow logs (format JSON, version 2) land in blob storage as records containing a flowTuple string per flow. Each tuple is comma-delimited and packs the following, in order:

Position	Field	Values	Notes
1	Timestamp	Epoch milliseconds	Window start
2	Source IP	IPv4/IPv6
3	Destination IP	IPv4/IPv6
4	Source port	0–65535
5	Destination port	0–65535
6	Protocol	`T` (TCP) / `U` (UDP)	Letter, not IANA number
7	Direction	`I` (inbound) / `O` (outbound)	Relative to the flow’s origin
8	Flow state	`B` begin / `C` continuing / `E` end	Enables session tracking across windows
9	Flow encryption (v2)	`X` encrypted / `NX` not / `NA`	VNet-flow-log addition over NSG v2
10	Packets src→dst	integer	Only on `C`/`E` records
11	Bytes src→dst	integer	Only on `C`/`E` records
12	Packets dst→src	integer	Only on `C`/`E` records
13	Bytes dst→src	integer	Only on `C`/`E` records

The flow-state field is the key VNet-flow-log improvement worth understanding: B marks the first record of a flow (no byte counts yet), C marks continuation (bytes accumulate), and E marks the end. Because bytes are reported on C/E records, you sum across the flow’s lifetime rather than reading a single record. VNet flow logs also, unlike NSG flow logs, record traffic without requiring the flow to hit an NSG rule — they capture at the VNet, so they see flows an NSG-attached log would miss.

The Traffic Analytics `NTANetAnalytics` schema — the enriched surface

This is the table you actually query on Azure. Traffic Analytics reads the raw tuples on its processing interval and writes enriched rows into NTANetAnalytics (the modern table; the older AzureNetworkAnalytics_CL custom log is being retired alongside NSG flow logs). Enrichment adds everything the raw log lacks. The columns you use in detections:

Column	Type	Use	Enrichment?
`TimeGenerated`	datetime	Time filter, binning	—
`SrcIp` / `DestIp`	string	Endpoints	—
`SrcPort` / `DestPort`	int	Ports; scan target	—
`L4Protocol`	string	`T`/`U`	—
`L7Protocol`	string	Best-effort app protocol	Yes
`FlowDirection`	string	`I` / `O`	—
`FlowStatus`	string	`Allowed` / `Denied`	—
`FlowType`	string	`IntraVNet`/`InterVNet`/`S2S`/`ExternalPublic`/`MaliciousFlow`/`AzurePublic`/`Unknown...`	Yes
`AclGroup`	string	The NSG that decided	Yes
`AclRule`	string	The specific rule that decided	Yes
`BytesSrcToDest` / `BytesDestToSrc`	long	Directional byte counts	—
`PacketsSrcToDest` / `PacketsDestToSrc`	long	Directional packet counts	—
`SrcPublicIps` / `DestPublicIps`	string	Public IP + flow counts packed	Yes
`Country`	string	Geo-IP country of the public endpoint	Yes
`Region`	string	Azure region of the resource	Yes
`SrcVm` / `DestVm`	string	Resolved VM name	Yes
`SrcSubnet` / `DestSubnet`	string	Resolved subnet	Yes
`NSGList` / `NSGRules`	string	NSG and rule detail	Yes
`MaliciousFlow`	string	Non-empty when matched to a bad IP	Yes
`SrcThreatIntel` / `DestThreatIntel`	string	Threat-intel match detail	Yes

The FlowType == "MaliciousFlow" classification and a non-empty MaliciousFlow column are the highest-signal starting points in the entire dataset — Traffic Analytics has already matched the public endpoint against Microsoft’s threat-intelligence feed. The AclGroup/AclRule pair is what makes Azure denied-flow analysis so much richer than AWS out of the box: you get the name of the rule that dropped the traffic, not just the fact that it was dropped.

The three schemas side by side

Because you will move between clouds, the field-equivalence map is worth pinning:

Concept	AWS VPC Flow Log	Azure raw VNet flow log	Traffic Analytics `NTANetAnalytics`
Source IP	`srcaddr` (`pkt-srcaddr` = real)	tuple pos 2	`SrcIp`
Destination IP	`dstaddr` (`pkt-dstaddr` = real)	tuple pos 3	`DestIp`
Dest port	`dstport`	tuple pos 5	`DestPort`
Protocol	`protocol` (IANA number)	tuple pos 6 (`T`/`U`)	`L4Protocol`
Verdict	`action` (`ACCEPT`/`REJECT`)	(implicit; NSG rule)	`FlowStatus` (`Allowed`/`Denied`)
Bytes out	`bytes` (per record)	tuple pos 11	`BytesSrcToDest`
Bytes in	`bytes` (reverse record)	tuple pos 13	`BytesDestToSrc`
Direction	`flow-direction` (custom)	tuple pos 7 (`I`/`O`)	`FlowDirection`
Deciding rule	(not present)	(not present)	`AclRule` / `AclGroup`
Geo/country	(build via lookup)	(not present)	`Country`
Threat match	(build via join)	(not present)	`MaliciousFlow`
Time	`start`/`end` (epoch s)	tuple pos 1 (epoch ms)	`TimeGenerated`

The single most important row is “Deciding rule / Geo / Threat match”: Azure Traffic Analytics gives you all three as columns; on AWS you must construct each yourself. That is the core architectural difference between the two pipelines.

Enabling flow logs: scope, interval, and format

Getting the collection right is 80% of the value. Two decisions dominate: scope (what resource you attach the log to) and interval/format (how much fidelity you capture).

Azure: target the VNet, not the NSG

On Azure, flow logs are configured through Network Watcher. There are two kinds, and one is deprecated. NSG flow logs attach to an NSG and only see traffic that traverses that NSG. VNet flow logs attach to a VNet (or subnet, or NIC) and see all traffic in scope regardless of NSG assignment — they follow the resource even as NSGs are reassigned. Always target the VNet for new deployments.

With the CLI:

# VNet flow log with Traffic Analytics enabled
az network watcher flow-log create \
  --location eastus \
  --name fl-vnet-prod \
  --resource-group rg-network-prod \
  --vnet vnet-prod \
  --storage-account stflowlogsprod \
  --log-version 2 \
  --retention 7 \
  --traffic-analytics true \
  --workspace $(az monitor log-analytics workspace show \
      -g rg-obs -n law-security --query id -o tsv) \
  --interval 10          # Traffic Analytics processing interval: 10 or 60 (minutes)

With Bicep:

resource flowLog 'Microsoft.Network/networkWatchers/flowLogs@2024-05-01' = {
  parent: networkWatcher
  name: 'fl-vnet-prod'
  location: location
  properties: {
    targetResourceId: vnet.id            // VNet-level flow log (not an NSG)
    storageId: flowStorage.id
    enabled: true
    format: {
      type: 'JSON'
      version: 2
    }
    retentionPolicy: {
      days: 7                            // raw retention in storage; analytics is separate
      enabled: true
    }
    flowAnalyticsConfiguration: {
      networkWatcherFlowAnalyticsConfiguration: {
        enabled: true
        workspaceResourceId: logAnalytics.id
        workspaceRegion: location
        trafficAnalyticsInterval: 10     // minutes; supported values are 10 or 60
      }
    }
  }
}

Raw retention (the retentionPolicy.days) is independent of, and separate from, Traffic Analytics ingestion into Log Analytics. Keep raw retention short and cheap; set Log Analytics retention deliberately (that is the real cost driver).

AWS: VPC scope, 1-minute interval, custom format to S3

On AWS, enable at the VPC level (captures all ENIs in the VPC, including new ones) with a 1-minute aggregation interval for security use, a custom format that adds the three missing fields, and delivery to S3 in Parquet with Hive-compatible partitions:

aws ec2 create-flow-logs \
  --resource-type VPC \
  --resource-ids vpc-0abc123 \
  --traffic-type ALL \
  --log-destination-type s3 \
  --log-destination arn:aws:s3:::netflow-logs-prod/vpc/ \
  --max-aggregation-interval 60 \
  --destination-options FileFormat=parquet,HiveCompatiblePartitions=true,PerHourPartition=true \
  --log-format '${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${start} ${end} ${action} ${log-status} ${flow-direction} ${pkt-srcaddr} ${pkt-dstaddr} ${tcp-flags} ${traffic-path}'

--max-aggregation-interval 60 means a 60-second window (the value is seconds; the alternative is 600 for 10 minutes). With Terraform:

resource "aws_flow_log" "vpc" {
  vpc_id                   = aws_vpc.prod.id
  traffic_type             = "ALL"
  log_destination_type     = "s3"
  log_destination          = "${aws_s3_bucket.flowlogs.arn}/vpc/"
  max_aggregation_interval = 60

  destination_options {
    file_format                = "parquet"
    hive_compatible_partitions = true
    per_hour_partition         = true
  }

  log_format = join(" ", [
    "$${version}", "$${account-id}", "$${interface-id}",
    "$${srcaddr}", "$${dstaddr}", "$${srcport}", "$${dstport}",
    "$${protocol}", "$${packets}", "$${bytes}", "$${start}", "$${end}",
    "$${action}", "$${log-status}", "$${flow-direction}",
    "$${pkt-srcaddr}", "$${pkt-dstaddr}", "$${tcp-flags}", "$${traffic-path}",
  ])
}

Two flags earn their place: FileFormat=parquet cuts Athena scan cost dramatically versus plain text (columnar + compressed), and HiveCompatiblePartitions=true makes partition discovery automatic (year=/month=/day=/hour= path structure).

The enable-decision tables

Scope choice, per platform, and what each sees:

Scope	Platform	Captures	Use when
VNet flow log	Azure	All flows in the VNet, all subnets/NICs	Default for new deployments
Subnet flow log	Azure	Flows in one subnet	Targeted collection, cost control
NIC flow log	Azure	Flows on one NIC	Investigating a single VM
NSG flow log (v2)	Azure	Flows crossing that NSG	Legacy only — migrate off
VPC flow log	AWS	All ENIs in the VPC (incl. new)	Default for new deployments
Subnet flow log	AWS	All ENIs in one subnet	Cost control, targeted subnet
ENI flow log	AWS	One network interface	Investigating one instance
Transit Gateway flow log	AWS	Traffic across the TGW	Central inspection / inter-VPC visibility

Aggregation interval trade-off:

Interval	Record volume	Timing detections	Best for	Platform
1 minute (60 s)	~10× baseline	Beaconing, scan cadence viable	Security pipeline	AWS
10 minutes (600 s)	Baseline	Coarse — bursts lost inside window	Cost accounting	AWS
10-minute TA interval	Enriched every 10 min	Near-real-time detections	Security pipeline	Azure
60-minute TA interval	Enriched hourly	Delayed; lower cost	Low-urgency / cost	Azure

Traffic-type / filter choice on AWS:

`traffic-type`	Records	Use when
`ALL`	Accepted + rejected	Security — you need denied flows
`ACCEPT`	Accepted only	Cost accounting; loses scan signal
`REJECT`	Rejected only	Cheap scan/attack surface monitoring

AWS delivery destinations: S3, CloudWatch Logs, and Firehose

Where the AWS logs go determines your query economics and latency. S3 is the default for a detection pipeline: cheapest storage, Parquet, Athena, Hive partitions — near-real-time (records arrive within minutes) but not streaming. CloudWatch Logs gives lower-latency delivery, Logs Insights queries, and metric filters that can alarm directly, but is markedly dearer per GB, so it suits a few high-urgency subnets, not the whole fleet. Amazon Data Firehose (formerly Kinesis Data Firehose) streams to a downstream consumer — OpenSearch, Splunk, or an S3 lake with transformation — for teams wanting flow logs inside an existing SIEM in near real time.

Destination	Latency	Query surface	Cost profile	Best for
S3 (Parquet)	Minutes	Athena (partition projection)	Cheapest at scale	Fleet-wide detection lake
S3 (plain text)	Minutes	Athena (full scans hurt)	Cheap storage, dear queries	Avoid — always use Parquet
CloudWatch Logs	~1 min	Logs Insights, metric filters, alarms	Dear per-GB ingest + storage	Low-volume, high-urgency subnets
Amazon Data Firehose	Seconds–minutes	Downstream (OpenSearch/Splunk/S3)	Streaming cost + destination	SIEM integration, real-time

A common production pattern is both: S3 for the cheap, queryable, long-retention detection lake, plus CloudWatch Logs or Firehose on a small number of critical subnets where you need real-time metric-filter alarms. You can create multiple flow logs on the same resource with different destinations.

CloudWatch metric-filter alarm on rejected flows, for the real-time slice:

# Metric filter: count REJECT records, then alarm on a spike
aws logs put-metric-filter \
  --log-group-name /vpc/flowlogs/critical \
  --filter-name reject-count \
  --filter-pattern '[version, account, eni, src, dst, srcport, dstport, protocol, packets, bytes, start, end, action=REJECT, status]' \
  --metric-transformations \
      metricName=RejectedFlows,metricNamespace=FlowLogs,metricValue=1

AWS query layer: partitioning and partition projection

The single biggest AWS cost mistake is letting Athena scan the whole bucket. With HiveCompatiblePartitions=true, logs land at paths like .../year=2026/month=06/day=08/hour=14/. You create the table as partitioned and use partition projection so Athena computes partition locations from the query predicate — no MSCK REPAIR TABLE, no Glue crawler, no stale partitions.

CREATE EXTERNAL TABLE vpc_flow_logs (
  version int, account_id string, interface_id string,
  srcaddr string, dstaddr string, srcport int, dstport int,
  protocol bigint, packets bigint, bytes bigint,
  start_ts bigint, end_ts bigint, action string, log_status string,
  flow_direction string, pkt_srcaddr string, pkt_dstaddr string,
  tcp_flags int, traffic_path int
)
PARTITIONED BY (`region` string, `day` string)
STORED AS PARQUET
LOCATION 's3://netflow-logs-prod/vpc/AWSLogs/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.region.type' = 'enum',
  'projection.region.values' = 'us-east-1,eu-west-1',
  'projection.day.type' = 'date',
  'projection.day.range' = '2026/01/01,NOW',
  'projection.day.format' = 'yyyy/MM/dd',
  'storage.location.template' =
    's3://netflow-logs-prod/vpc/AWSLogs/${region}/${day}'
);

Every detection query then filters on day (and optionally region), and Athena scans only those partitions. A query that previously scanned 400 GB scans a single day’s few GB.

The partition-projection knobs, and when each matters:

Property	What it does	Typical value	Gotcha
`projection.enabled`	Turns projection on	`true`	Without it, Athena needs cataloged partitions
`projection.<col>.type`	Projection type	`date`, `enum`, `integer`	Type must match the column semantics
`projection.day.range`	Bounds of the date projection	`2026/01/01,NOW`	Too-wide a range = wasted partition enumeration
`projection.day.format`	Date format in the path	`yyyy/MM/dd`	Must match the actual S3 path exactly
`storage.location.template`	Maps partition values → S3 path	`s3://…/${region}/${day}`	A wrong template silently returns zero rows

Cost-shaping choices at the query layer:

Lever	Effect on Athena cost	Effort	Notes
Parquet vs text	5–20× less scanned	Set at enable time	Columnar + compression
Partition projection	Scan one day, not all history	One-time table DDL	No crawler needed
`SELECT` only needed columns	Less columnar scan	Per query	Never `SELECT *` on a lake
Compression (Snappy/ZSTD)	Smaller files	Delivery option	ZSTD denser, Snappy faster
CTAS to a rollup table	Pre-aggregate hot detections	Scheduled	Cheap repeated dashboards

Azure Traffic Analytics: enrichment and the processing model

This is where Azure pulls ahead. Raw VNet flow logs are just tuples in blob; Traffic Analytics processes them on the interval you set and writes enriched rows into NTANetAnalytics — the enrichment you would otherwise build by hand on AWS:

Enrichment	What it gives you	AWS equivalent effort
Geo-IP	`Country`/`Region` on public endpoints	Join a MaxMind/IP2Location table in Athena
Flow classification	`FlowType` (intra/inter-VNet, S2S, external, malicious)	Regex classify src/dst against your CIDR inventory
Rule resolution	`AclGroup`/`AclRule` that decided	Not recoverable from VPC flow logs
Resource resolution	`SrcVm`/`DestVm`, `SrcSubnet`/`DestSubnet`	Join ENI → instance → subnet via describe APIs
Threat-intel match	`MaliciousFlow`, `*ThreatIntel`	Join against your own intel feed
Topology / geomap	Built-in map + topology in the workbook	Build in QuickSight/Grafana

The FlowType values and what they mean for detection targeting:

`FlowType`	Meaning	Detection relevance
`IntraVNet`	Both endpoints in the same VNet	East-west lateral movement
`InterVNet`	Across peered VNets	Cross-tier / cross-app lateral movement
`S2S`	Site-to-site (on-prem via VPN/ER)	Hybrid egress/ingress review
`P2S`	Point-to-site (VPN clients)	Remote-access anomaly
`ExternalPublic`	One endpoint is a public internet IP	Exfil, C2, external egress
`AzurePublic`	Traffic to Azure public service IPs	Baseline; usually benign
`MaliciousFlow`	Matched a known-bad IP	Highest signal — triage first
`Unknown` / `UnknownPrivate`	Could not classify	Investigate if volume is high

The geomap in the Traffic Analytics workbook plots ExternalPublic flows by Country, which is the fastest visual way to spot egress to a geography you do not operate in — the same signal the country-egress KQL below produces as a table.

A quick sanity query confirming Traffic Analytics is enriching, broken down by classification:

NTANetAnalytics
| where TimeGenerated > ago(2h)
| summarize Records = count(),
            Denied = countif(FlowStatus == "Denied"),
            Malicious = countif(FlowType == "MaliciousFlow")
    by FlowType
| order by Records desc

A healthy result shows records across multiple FlowType values. If NTANetAnalytics is empty but the storage account has blobs, Traffic Analytics is disabled or pointed at the wrong workspace.

The NSG-flow-logs to VNet-flow-logs deprecation

This is a migration you cannot ignore. Microsoft retired NSG flow logs on 30 June 2025 — after that date, NSG flow logs stopped ingesting data, and the guidance is to move all collection to VNet flow logs. New NSG flow logs can no longer be created, and Traffic Analytics on NSG flow logs is superseded by Traffic Analytics on VNet flow logs writing to NTANetAnalytics (the older AzureNetworkAnalytics_CL table is retired on the same track). If you still have NSG flow logs configured, they are producing nothing and your detections have a blind spot.

Why VNet flow logs are strictly better, not just newer:

Dimension	NSG flow log (legacy)	VNet flow log (go-forward)
Attachment point	An NSG	A VNet / subnet / NIC
Coverage	Only flows crossing that NSG	All flows in scope, NSG or not
Survives NSG reassignment	No — log breaks	Yes — follows the VNet
Flow state (B/C/E)	Limited	Full begin/continue/end
Encryption field	No	Yes (`X`/`NX`/`NA`)
Ingestion status (post-2025)	Stopped	Active
Traffic Analytics table	`AzureNetworkAnalytics_CL` (retiring)	`NTANetAnalytics`
Redundant NSG-pair logging	Common (inbound+outbound NSG both log)	Eliminated — one VNet log

The migration steps, in order:

#	Step	Command / action	Verify
1	Inventory existing NSG flow logs	`az network watcher flow-log list -g rg --query "[?targetResourceId contains 'networkSecurityGroups']"`	List of NSG-scoped logs
2	Create VNet flow log per VNet	`az network watcher flow-log create --vnet …`	New log enabled
3	Enable Traffic Analytics on the VNet log	`--traffic-analytics true --workspace …`	Rows in `NTANetAnalytics`
4	Repoint detections to `NTANetAnalytics`	Update saved KQL / alert rules	Queries return rows
5	Confirm parity for a known flow	Generate a test flow, find it in both	Present in new table
6	Delete the NSG flow logs	`az network watcher flow-log delete …`	Old logs removed
7	Update IaC to VNet-scope only	Bicep/Terraform PR	No NSG-flow-log resources remain

Do not skip step 5. Deleting the old logs before confirming the new ones enrich correctly is how teams create a multi-day gap in their detection coverage.

Building the detection pipeline: the query patterns

With collection and enrichment in place, the pipeline is the set of detections — each a pattern over the 5-tuple, verdict, counters and time. Here is the ruleset (KQL for Azure NTANetAnalytics, Athena SQL for AWS), organised by attack class.

Denied-flow analysis

The workhorse first query. On Azure you also get the rule that denied, which AWS cannot give you. Denied flows grouped by rule — find which NSG rule is dropping traffic and what is hitting it:

NTANetAnalytics
| where TimeGenerated > ago(1h)
| where FlowStatus == "Denied"
| summarize Flows = count(),
            Bytes = sum(BytesSrcToDest + BytesDestToSrc),
            Ports = dcount(DestPort),
            Sources = dcount(SrcIp)
    by AclGroup, AclRule, DestPort
| top 50 by Flows desc

The AWS equivalent — top denied offenders (no rule name available, so key off the tuple):

SELECT srcaddr, dstaddr, dstport,
       COUNT(*) AS rejects, SUM(bytes) AS bytes
FROM vpc_flow_logs
WHERE day >= date_format(current_date - interval '1' day, '%Y/%m/%d')
  AND action = 'REJECT'
GROUP BY srcaddr, dstaddr, dstport
ORDER BY rejects DESC
LIMIT 50;

Denied-flow spike (per rule / per source)

A sudden surge in denials against one rule means scanning, a misconfigured client, or someone probing your policy. Alert on the rate, not individual flows:

NTANetAnalytics
| where TimeGenerated > ago(1h)
| where FlowStatus == "Denied"
| summarize Denials = count() by SrcIp, bin(TimeGenerated, 5m)
| where Denials > 200            // tune to your baseline
| summarize PeakDenials = max(Denials), Windows = count() by SrcIp
| order by PeakDenials desc

Top talkers

Resolve the noisiest source/destination pairs — the baseline you build every other detection against:

NTANetAnalytics
| where TimeGenerated > ago(24h)
| where FlowStatus == "Allowed"
| extend TotalBytes = BytesSrcToDest + BytesDestToSrc
| summarize Bytes = sum(TotalBytes), Flows = count()
    by SrcIp, DestIp, DestPort
| top 25 by Bytes desc

AWS, using pkt_dstaddr to see the real destination behind a NAT gateway rather than the NAT’s own address:

SELECT srcaddr, pkt_dstaddr AS real_dst, dstport,
       SUM(bytes) AS total_bytes, COUNT(*) AS flows
FROM vpc_flow_logs
WHERE day >= date_format(current_date - interval '1' day, '%Y/%m/%d')
  AND action = 'ACCEPT'
  AND flow_direction = 'egress'
GROUP BY srcaddr, pkt_dstaddr, dstport
ORDER BY total_bytes DESC
LIMIT 25;

Data exfiltration — the byte-ratio tell

Exfiltration looks like sustained, asymmetric outbound volume to one external destination. The tell is the byte ratio: far more bytes leaving than arriving. In KQL:

NTANetAnalytics
| where TimeGenerated > ago(6h)
| where FlowType == "ExternalPublic" and FlowStatus == "Allowed"
| summarize Out = sum(BytesSrcToDest), In = sum(BytesDestToSrc)
    by SrcIp, DestIp, Country
| extend Ratio = todouble(Out) / (In + 1)
| where Out > 50000000 and Ratio > 20     // >50 MB out, 20:1 outbound skew
| order by Out desc

The AWS version keys off egress direction to non-RFC1918 destinations. Because VPC flow logs do not split in/out bytes on one row, you compute the ratio by joining egress and ingress aggregates per pair, or approximate with egress volume alone plus a destination-port heuristic:

-- Egress volume to public destinations on non-web ports (exfil shape)
SELECT srcaddr, pkt_dstaddr, dstport,
       SUM(bytes) AS bytes_out, COUNT(*) AS flows
FROM vpc_flow_logs
WHERE day >= date_format(current_date - interval '1' day, '%Y/%m/%d')
  AND action = 'ACCEPT' AND flow_direction = 'egress'
  AND NOT regexp_like(pkt_dstaddr,
        '^(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.|100\.(6[4-9]|[7-9][0-9]|1[01][0-9]|12[0-7])\.)')
  AND dstport NOT IN (443, 80)        -- the long tail past normal web egress
GROUP BY srcaddr, pkt_dstaddr, dstport
HAVING SUM(bytes) > 50000000
ORDER BY bytes_out DESC;

Unexpected egress to unapproved geographies

Egress to public IPs in countries you do not operate in — the compliance and exfil detection combined:

NTANetAnalytics
| where TimeGenerated > ago(24h)
| where FlowType == "ExternalPublic" and FlowStatus == "Allowed"
| where Country !in ("United States", "Ireland", "")   // your operating geos
| summarize Bytes = sum(BytesSrcToDest), Flows = count()
    by SrcIp, DestIp, Country, DestPort
| where Bytes > 10000000                                  // > ~10 MB
| order by Bytes desc

Port scan — one source, many destination ports

A scanner enumerates services on a target: one source IP hitting many destination ports, mostly denied (for a service that is not there) or with SYN-only tcp-flags. In KQL:

NTANetAnalytics
| where TimeGenerated > ago(1h)
| summarize DistinctPorts = dcount(DestPort),
            Denied = countif(FlowStatus == "Denied"),
            Flows = count()
    by SrcIp, DestIp, bin(TimeGenerated, 10m)
| where DistinctPorts > 50 and Denied > 30    // many ports, mostly refused
| order by DistinctPorts desc

AWS, adding the SYN-only tcp-flags refinement (tcp_flags = 2 is SYN with no ACK — the half-open scan fingerprint):

SELECT srcaddr, dstaddr,
       COUNT(DISTINCT dstport) AS distinct_ports,
       COUNT(*) AS flows,
       SUM(CASE WHEN action = 'REJECT' THEN 1 ELSE 0 END) AS rejects
FROM vpc_flow_logs
WHERE day = date_format(current_date, '%Y/%m/%d')
  AND protocol = 6                         -- TCP
  AND tcp_flags = 2                        -- SYN only (no ACK) = half-open scan
GROUP BY srcaddr, dstaddr
HAVING COUNT(DISTINCT dstport) > 50
ORDER BY distinct_ports DESC;

Host sweep — one source, one port, many hosts

The inverse of a port scan: the attacker knows the service and looks for hosts running it (e.g. sweeping the subnet for open 445/SMB or 22/SSH):

NTANetAnalytics
| where TimeGenerated > ago(1h)
| where DestPort in (22, 445, 3389, 5985, 1433, 3306)   // services worth sweeping
| summarize DistinctHosts = dcount(DestIp), Flows = count()
    by SrcIp, DestPort, bin(TimeGenerated, 10m)
| where DistinctHosts > 25
| order by DistinctHosts desc

SELECT srcaddr, dstport,
       COUNT(DISTINCT dstaddr) AS distinct_hosts, COUNT(*) AS flows
FROM vpc_flow_logs
WHERE day = date_format(current_date, '%Y/%m/%d')
  AND dstport IN (22, 445, 3389, 5985, 1433, 3306)
  AND protocol = 6
GROUP BY srcaddr, dstport
HAVING COUNT(DISTINCT dstaddr) > 25
ORDER BY distinct_hosts DESC;

C2 beaconing — periodic, low-jitter heartbeats

Command-and-control beacons are regular: a small flow to the same destination at a near-constant interval. Flow logs cannot see the payload, but they can see the cadence. The detection computes the standard deviation of inter-arrival times per (source, destination) pair — a low stddev relative to the mean is the beaconing fingerprint. In KQL:

NTANetAnalytics
| where TimeGenerated > ago(24h)
| where FlowType == "ExternalPublic"
| project TimeGenerated, SrcIp, DestIp, DestPort,
          Bytes = BytesSrcToDest + BytesDestToSrc
| order by SrcIp, DestIp, TimeGenerated asc
| serialize
| extend PrevTime = prev(TimeGenerated), PrevPair = strcat(prev(SrcIp), prev(DestIp))
| where strcat(SrcIp, DestIp) == PrevPair
| extend GapSec = datetime_diff('second', TimeGenerated, PrevTime)
| summarize Beacons = count(), AvgGap = avg(GapSec),
            StdGap = stdev(GapSec), AvgBytes = avg(Bytes)
    by SrcIp, DestIp, DestPort
| where Beacons > 20 and AvgGap > 30 and StdGap < AvgGap * 0.2  // regular + low jitter
| order by StdGap asc

The AWS equivalent using window functions over ordered flow start times:

WITH ordered AS (
  SELECT srcaddr, pkt_dstaddr, dstport, start_ts, bytes,
         LAG(start_ts) OVER (
           PARTITION BY srcaddr, pkt_dstaddr, dstport ORDER BY start_ts
         ) AS prev_ts
  FROM vpc_flow_logs
  WHERE day = date_format(current_date, '%Y/%m/%d')
    AND action = 'ACCEPT' AND flow_direction = 'egress'
    AND NOT regexp_like(pkt_dstaddr, '^(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.)')
),
gaps AS (
  SELECT srcaddr, pkt_dstaddr, dstport, bytes,
         (start_ts - prev_ts) AS gap
  FROM ordered WHERE prev_ts IS NOT NULL
)
SELECT srcaddr, pkt_dstaddr, dstport,
       COUNT(*) AS beacons,
       AVG(gap) AS avg_gap,
       STDDEV(gap) AS std_gap,
       AVG(bytes) AS avg_bytes
FROM gaps
GROUP BY srcaddr, pkt_dstaddr, dstport
HAVING COUNT(*) > 20 AND AVG(gap) > 30 AND STDDEV(gap) < AVG(gap) * 0.2
ORDER BY std_gap ASC;

East-west surprise — lateral movement across tiers

A workload talking laterally to a tier it has no business reaching — a frontend connecting straight to the database subnet, bypassing the app tier. With FlowType == "IntraVNet"/"InterVNet" you baseline expected tier-to-tier pairs and alert on any new one. The Athena version keys off your subnet CIDRs directly:

-- Frontend subnet (10.1.1.0/24) reaching DB subnet (10.1.3.0/24) directly
SELECT srcaddr, dstaddr, dstport, COUNT(*) AS flows, SUM(bytes) AS bytes
FROM vpc_flow_logs
WHERE day = date_format(current_date, '%Y/%m/%d')
  AND action = 'ACCEPT'
  AND regexp_like(srcaddr, '^10\.1\.1\.')
  AND regexp_like(dstaddr, '^10\.1\.3\.')
GROUP BY srcaddr, dstaddr, dstport;

Rule miss — an ACCEPT that policy forbids

The highest-value config detection: a rule you think is blocking traffic but is not. If policy says no host should reach the metadata service or an admin port from the workload tier, a single ACCEPT is a finding:

-- Any ACCEPT to a port that policy says should never be reachable
SELECT srcaddr, pkt_dstaddr, dstport, SUM(bytes) AS bytes, COUNT(*) AS flows
FROM vpc_flow_logs
WHERE day = date_format(current_date, '%Y/%m/%d')
  AND action = 'ACCEPT'
  AND dstport IN (22, 3389, 2375, 6443, 10250)   -- SSH, RDP, docker, kube-api, kubelet
GROUP BY srcaddr, pkt_dstaddr, dstport;

NTANetAnalytics
| where TimeGenerated > ago(1h)
| where FlowStatus == "Allowed"
| where DestPort in (22, 3389, 2375, 6443, 10250)
| summarize Bytes = sum(BytesSrcToDest + BytesDestToSrc), Flows = count()
    by SrcIp, DestIp, DestPort, AclRule
| order by Flows desc

The detection ruleset as a reference

Every detection above, its thresholds, the flow-log fields it depends on, and the MITRE ATT&CK technique it maps to:

Detection	Key fields	Threshold (starting point)	MITRE technique
Denied-flow spike	verdict, src, time	> 200 denials / 5 min / src	T1046 Network Service Discovery
Exfil (byte ratio)	out bytes, in bytes, ext	> 50 MB out, ratio > 20:1	T1041 Exfiltration Over C2 Channel
Exfil (geo egress)	ext bytes, `Country`	> 10 MB to unapproved geo	T1567 Exfiltration Over Web Service
Port scan	dst port, verdict, flags	> 50 ports, mostly denied / SYN-only	T1046 Network Service Discovery
Host sweep	dst IP, one port	> 25 hosts, one service port	T1046 Network Service Discovery
C2 beaconing	time cadence, small bytes	> 20 flows, jitter < 20% of mean	T1071 Application Layer Protocol
East-west surprise	src/dst CIDR pair	any flow not in tier baseline	T1021 Remote Services / lateral
Rule miss (ACCEPT)	verdict, dst port	any ACCEPT to forbidden port	T1190 / control validation
Malicious-IP match (Azure)	`FlowType`/`MaliciousFlow`	any `MaliciousFlow`	T1071 / threat-intel match

Which fields each detection requires, so you know what your collection format must capture:

Detection	Needs `tcp-flags`	Needs `flow-direction`	Needs `pkt-dstaddr`	Needs 1-min interval
Denied-flow spike	No	No	No	Helps
Exfil (ratio/geo)	No	Yes	Yes	No
Port scan	Yes (SYN refinement)	No	No	Yes
Host sweep	No	No	No	Yes
C2 beaconing	No	Yes	Yes	Yes
East-west surprise	No	No	No	No
Rule miss	No	No	Yes	No

Dashboards and alerting without alert fatigue

The trap is alerting on raw flow counts. Flow volume is noisy and self-similar; a threshold that catches a real incident on Tuesday fires twenty times during Monday’s batch job. Alert on derived, baselined metrics.

On Azure, schedule the detection KQL as a Log Analytics alert rule and require both a volume floor and a deviation from baseline, so normal traffic never trips it:

let lookback = 7d;
let baseline = toscalar(
    NTANetAnalytics
    | where TimeGenerated between (ago(lookback) .. ago(1d))
    | where FlowType == "ExternalPublic"
    | summarize avg(BytesSrcToDest)
);
NTANetAnalytics
| where TimeGenerated > ago(1h)
| where FlowType == "ExternalPublic"
| summarize Out = sum(BytesSrcToDest) by SrcIp
| where Out > baseline * 10 and Out > 100000000   // 10× baseline AND >100 MB

The noise-control rules that separate a usable pipeline from a muted one:

Rule	Instead of	Why
Alert on denied-flow rate per rule	Alert per denied flow	One scan = one alert, not thousands
Alert on new external destinations	Alert on volume to known ones	Known CDNs never re-page
Gate on baseline deviation + volume floor	A fixed byte threshold	Batch jobs never trip it
Make geo/exfil detections stateful	Re-fire on every run	Suppress once triaged
Bin by 5–10 min windows	Per-flow evaluation	Smooths self-similar noise
Exclude known-good pairs by allow-list	Manual triage each time	Cuts recurring false positives

The three panels that answer 90% of investigations — build the dashboard (Azure Workbook or a Grafana panel over Athena) around these:

Panel	Query basis	Answers
Top talkers	Bytes by src/dst/port	“What is moving the most data?”
Denied-by-rule	Denials by `AclRule` / tuple	“What is being blocked, and is it a scan?”
External egress by country	`ExternalPublic` bytes by `Country`	“Are we egressing where we shouldn’t?”

Architecture at a glance

Picture the pipeline as three horizontal bands: collection, enrichment/storage, and detection — with the two clouds running parallel tracks that converge at the SIEM.

On the Azure track, traffic inside a VNet is captured by a VNet flow log (not an NSG flow log — that path is deprecated and stopped ingesting on 30 June 2025). Raw comma-delimited tuples land in a blob storage account with short retention (7 days is plenty). Traffic Analytics reads those blobs on a 10-minute processing interval and enriches them — geo-IP by country, FlowType classification, the deciding NSG rule, VM/subnet resolution, and a match against Microsoft’s known-malicious-IP feed — then writes enriched rows into the NTANetAnalytics table in a Log Analytics workspace. Your KQL detections and alert rules run there.

On the AWS track, traffic across a VPC is captured by a VPC flow log with a custom format (adding the three fields the default drops: tcp-flags, flow-direction, pkt-dstaddr) at a 1-minute interval. Records land in S3 as Parquet under Hive-compatible year=/month=/day=/hour= partitions. Athena, backed by an external table with partition projection, queries only the partitions a detection touches — so a query scans one day, not the whole bucket. A small, critical slice of subnets also streams to CloudWatch Logs or Amazon Data Firehose for real-time metric-filter alarms. The enrichment Traffic Analytics does for free — geo, classification, threat-intel — you build here with lookup-table joins and regex classification.

Both tracks converge at the detection ruleset and SIEM: the KQL and Athena queries for exfil, port scans, host sweeps, C2 beaconing, denied-flow spikes and east-west surprises run on a schedule, gated by baseline-deviation-plus-volume-floor, and every finding feeds back — the closing arrow of the whole diagram — into an NSG or security-group rule change in the IaC repo, where the same detection query becomes the regression test. Read it left to right on each track (capture → store/enrich → detect) and top to bottom at the end (detect → alert → change rule → re-verify): that loop is the entire point of the system.

Real-world scenario

Meridian Pay, a fictional but representative fintech platform team, ran about 80 VPCs across three AWS accounts behind a centralized inspection VPC, plus a smaller Azure footprint for their identity workloads. They had two problems that turned out to share one root cause: a PCI-DSS and data-residency requirement to prove no production workload egressed to a non-approved country, and an Athena bill that had crept to four figures a month because every analyst ran SELECT * ... WHERE against an unpartitioned, plain-text flow-log table that scanned the entire bucket on every query. The security lead’s blunt summary: “We have all the data and can’t afford to look at it, and we can’t prove the thing the auditor asks.”

The diagnosis was quick. A single “which hosts talked to this IP last week” query scanned 400 GB — because the table had no partitions and the logs were text, so Athena read every byte of history for a one-day question. And the country-egress evidence the auditor wanted did not exist as a repeatable artefact; someone produced it by hand each quarter, badly.

Two changes fixed both. First, they re-created the flow logs in Parquet with Hive-compatible hourly partitions and rebuilt the Athena table with partition projection scoped to region and day. A query that previously scanned 400 GB now scanned a single day’s few GB, dropping per-query cost by over 95%; switching analysts off SELECT * onto column lists shaved it further. Second, they codified the country-egress check as a scheduled Athena query writing results to a compliance bucket, with destination IPs run through a geo lookup joined in the query — turning a manual quarterly scramble into a daily, auditable artefact.

The detection that mattered fired within a week. A batch job in a staging account had a NAT route that let it reach a package mirror in a region outside their approved list. Flow logs caught it because the egress showed up as an ExternalPublic-shaped flow to an unexpected country, and — critically — pkt_dstaddr revealed the real mirror behind the NAT gateway, not the NAT’s own address. Without that custom-format field, the report would have shown the NAT IP and the geo lookup would have resolved to the NAT’s region, hiding the finding entirely. Remediation was a two-line staging-egress route change plus an FQDN allow-list entry for the approved mirror, both landed in the same Terraform module as the flow-log config — and the compliance query became the regression test.

The scheduled compliance query, keyed off the approved-CIDR exclusion they maintained as a list:

-- Daily compliance: egress to any public dst outside approved ranges
SELECT account_id, srcaddr, pkt_dstaddr, dstport, SUM(bytes) AS bytes
FROM vpc_flow_logs
WHERE day = date_format(current_date - interval '1' day, '%Y/%m/%d')
  AND action = 'ACCEPT' AND flow_direction = 'egress'
  AND NOT regexp_like(pkt_dstaddr,
        '^(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.|100\.(6[4-9]|[7-9][0-9]|1[01][0-9]|12[0-7])\.)')
GROUP BY account_id, srcaddr, pkt_dstaddr, dstport
HAVING SUM(bytes) > 1000000
ORDER BY bytes DESC;

The RFC1918-plus-CGNAT exclusion regex is what makes the report trustworthy: it keeps internal and carrier-grade-NAT ranges out of the “external egress” result so the artefact shows only genuinely public destinations. The three-line lesson they wrote on the wall: “Store it so you can afford to query it. Capture the real destination, not the NAT’s. A finding that doesn’t change a route is just a log line.”

The changes as a before/after ledger, because the order and the numbers are the lesson:

Dimension	Before	After	Mechanism
Storage format	Plain text	Parquet	Delivery option at enable time
Table layout	Unpartitioned	Hive partitions + projection	Table DDL
Per-query scan	~400 GB	~1 day’s few GB	Partition pruning
Athena spend	Four figures / mo	> 95% lower	Less data scanned
Compliance evidence	Manual, quarterly	Scheduled, daily, auditable	CTAS to compliance bucket
Real destination visible	No (saw NAT IP)	Yes (`pkt_dstaddr`)	Custom format field
Finding → fix	Ad hoc	Same IaC repo + regression query	Closed loop

Advantages and disadvantages

Flow-log-based detection is powerful and cheap, but it has a hard ceiling — it is L3/L4 only. Weigh it honestly:

Advantages	Disadvantages
Telemetry you already have — no new agent, tap, or NDR appliance	Pure L3/L4 — no process, user, DNS, SNI, or payload
Azure Traffic Analytics enriches (geo, classification, rule, threat-intel) for you	On AWS you build every enrichment yourself
Sees encrypted traffic’s metadata — volume, destination, cadence	Cannot see what was in the encrypted payload
Cheap at scale with Parquet + partition projection	Naive storage (text, unpartitioned) makes queries ruinously expensive
Perfect for exfil-volume, scan, beacon-cadence, and geo-egress detection	Sampled/aggregated — loses per-packet timing inside a window
The deciding rule (Azure `AclRule`) makes denied-flow analysis actionable	AWS gives no rule name — you infer from the tuple
Ideal breach-reconstruction and compliance-evidence artefact	REJECT only appears when a rule denies — gaps from upstream drops / asymmetry
Detections are portable KQL/SQL you own, not a vendor black box	Beaconing/exfil detection needs 1-min interval and custom fields — more volume

Flow logs are the right tool for network-metadata detection: exfil volume, scan patterns, beacon cadence, geo-egress compliance, and east-west segmentation verification. They are the wrong tool for payload inspection (use a firewall with TLS inspection or an NDR), application-layer context (correlate with app logs), or identity (correlate with the IdP). The disadvantages are manageable — the payload blind spot is filled by an egress firewall, the AWS enrichment gap by lookup joins, the cost trap by Parquet and projection. The one thing you cannot fix is sampling: when you need per-packet truth you reach for When Logs Aren’t Enough: Packet Capture, Traffic Mirroring, and Deep Network Troubleshooting.

Hands-on lab

Build a minimal but complete pipeline on AWS — enable a custom-format VPC flow log to S3, create the Athena table with partition projection, generate a deliberate denied flow, and confirm it surfaces in a detection query. Everything here is low-cost (a NAT-less test VPC, a few MB of logs, a handful of Athena queries scanning a single day). Run in a sandbox account.

Step 1 — Variables and a test VPC.

REGION=us-east-1
BUCKET=netflow-lab-$RANDOM
VPC=$(aws ec2 create-vpc --cidr-block 10.9.0.0/16 \
        --query 'Vpc.VpcId' --output text --region $REGION)
aws s3 mb s3://$BUCKET --region $REGION
echo "VPC=$VPC BUCKET=$BUCKET"

Expected: a VPC ID like vpc-0abc… and a created bucket.

Step 2 — Grant flow logs permission to write to S3. S3-destination flow logs use a bucket policy (not an IAM role); the delivery service writes directly. Add the policy:

cat > /tmp/flowlog-bucket-policy.json <<POLICY
{ "Version": "2012-10-17", "Statement": [
  { "Sid": "AWSLogDeliveryWrite", "Effect": "Allow",
    "Principal": { "Service": "delivery.logs.amazonaws.com" },
    "Action": "s3:PutObject",
    "Resource": "arn:aws:s3:::$BUCKET/AWSLogs/*" },
  { "Sid": "AWSLogDeliveryCheck", "Effect": "Allow",
    "Principal": { "Service": "delivery.logs.amazonaws.com" },
    "Action": ["s3:GetBucketAcl","s3:ListBucket"],
    "Resource": "arn:aws:s3:::$BUCKET" } ]
}
POLICY
aws s3api put-bucket-policy --bucket $BUCKET --policy file:///tmp/flowlog-bucket-policy.json

Step 3 — Create the custom-format flow log at 1-minute interval.

aws ec2 create-flow-logs \
  --resource-type VPC --resource-ids $VPC \
  --traffic-type ALL \
  --log-destination-type s3 \
  --log-destination arn:aws:s3:::$BUCKET/ \
  --max-aggregation-interval 60 \
  --destination-options FileFormat=parquet,HiveCompatiblePartitions=true,PerHourPartition=true \
  --log-format '${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${start} ${end} ${action} ${log-status} ${flow-direction} ${pkt-srcaddr} ${pkt-dstaddr} ${tcp-flags} ${traffic-path}' \
  --region $REGION

Expected: {"Unsuccessful": []} and a FlowLogId. Confirm it is active:

aws ec2 describe-flow-logs --region $REGION \
  --filter "Name=resource-id,Values=$VPC" \
  --query 'FlowLogs[].{Status:FlowLogStatus,Dest:LogDestination,Fmt:LogFormat}'

Expected: Status: ACTIVE.

Step 4 — Generate traffic, including a deliberate denied flow. Launch a tiny instance (or use an existing one) and make an allowed egress call plus a call to a port the security group blocks, so you get both an ACCEPT and a REJECT. Wait ~10–15 minutes for the first records to land (flow logs have a delivery delay). Confirm objects arrived:

aws s3 ls s3://$BUCKET/AWSLogs/ --recursive --region $REGION | head

Expected: Parquet objects under .../year=2026/month=06/day=08/hour=…/.

Step 5 — Create the Athena table with partition projection. In the Athena console (or via aws athena start-query-execution), run the CREATE EXTERNAL TABLE from the query-layer section, adjusting LOCATION, projection.region.values, and the bucket name to match. Then a scoped query:

SELECT action, COUNT(*) AS flows, SUM(bytes) AS bytes
FROM vpc_flow_logs
WHERE region = 'us-east-1'
  AND day = date_format(current_date, '%Y/%m/%d')
GROUP BY action;

Expected: rows for ACCEPT and REJECT. If you see only ACCEPT, your test’s denied call did not hit a rule — re-check the security group.

Step 6 — Run a detection and confirm your denied flow surfaces. The denied-offenders query, scoped to today:

SELECT srcaddr, dstaddr, dstport, COUNT(*) AS rejects
FROM vpc_flow_logs
WHERE region = 'us-east-1'
  AND day = date_format(current_date, '%Y/%m/%d')
  AND action = 'REJECT'
GROUP BY srcaddr, dstaddr, dstport
ORDER BY rejects DESC;

Expected: your deliberate denied flow appears as a row. That proves the loop end to end: enable → deliver → partition → query → detect.

Validation checklist. You enabled a custom-format flow log (capturing pkt-dstaddr, flow-direction, tcp-flags), delivered Parquet with Hive partitions, built an Athena table with partition projection (scoped scans, no MSCK REPAIR), and confirmed a deliberate denied flow surfaces in a detection query scanning a single day. The steps mapped to what each proves:

Step	What you did	What it proves
3	Custom format, 1-min, Parquet	Collection captures detection-grade fields cheaply
4	Generated ACCEPT + REJECT	Both verdicts are recorded
5	Partition-projected table	Scoped scans work without a crawler
6	Denied-offenders query	The detection surfaces a real denied flow

Cleanup.

FLID=$(aws ec2 describe-flow-logs --region $REGION \
  --filter "Name=resource-id,Values=$VPC" --query 'FlowLogs[0].FlowLogId' --output text)
aws ec2 delete-flow-logs --flow-log-ids $FLID --region $REGION
aws s3 rb s3://$BUCKET --force --region $REGION
aws ec2 delete-vpc --vpc-id $VPC --region $REGION

Cost note. A few MB of flow logs, a handful of single-day Athena scans, and a short-lived micro instance cost well under ₹50 for the whole lab. Athena bills per TB scanned; because every query above is partition-scoped to one day, each scans a trivial amount. Delete the bucket to stop storage charges.

Common mistakes & troubleshooting

The failure modes that cost the most time, as a symptom → cause → confirm → fix table, then the detail on the ones that bite hardest.

#	Symptom	Root cause	Confirm (exact cmd / path)	Fix
1	`NTANetAnalytics` empty, but blobs exist in storage	Traffic Analytics disabled or wrong workspace	`az network watcher flow-log show --query flowAnalyticsConfiguration`	Enable TA; point to the correct workspace
2	Azure detections return nothing after mid-2025	Still on NSG flow logs (retired 30 Jun 2025)	`az network watcher flow-log list` shows NSG-scoped logs	Migrate to VNet flow logs (see deprecation section)
3	Athena query scans 400 GB, costs a fortune	Unpartitioned and/or plain-text table	Athena “Data scanned” per query	Parquet + Hive partitions + partition projection
4	Athena returns zero rows despite objects in S3	`storage.location.template` / `LOCATION` wrong	Compare template to actual S3 path	Fix template/date format to match paths exactly
5	Exfil detection sees NAT IP, resolves wrong country	Default format omits `pkt-dstaddr`	`describe-flow-logs --query 'FlowLogs[].LogFormat'`	Recreate with custom format incl. `pkt-dstaddr`
6	Port-scan detection misses SYN scans	Default format omits `tcp-flags`	Same as #5	Add `tcp-flags` to the custom format
7	Beaconing detection finds nothing	10-min interval hides cadence	`describe-flow-logs --query 'FlowLogs[].MaxAggregationInterval'` = 600	Recreate at `--max-aggregation-interval 60`
8	Alerts fire constantly on batch traffic	Alerting on raw flow counts	Review the alert rule’s query	Gate on baseline deviation + volume floor
9	No REJECT records for known-blocked traffic	Traffic dropped upstream / asymmetric route	Check if drop is at NACL/firewall, not SG	Log at the layer that drops; add NACL flow logs
10	Log Analytics bill spikes after enabling TA	Fine TA interval + long workspace retention	Workspace usage by table (`NTANetAnalytics`)	Coarsen interval; set retention deliberately
11	Partitions look stale / missing recent days	Projection `range` ends before today	Check `projection.day.range`	Set range end to `NOW`
12	Custom format order mismatched at parse	Athena column order ≠ `log-format` order	Diff DDL columns vs `--log-format`	Align column order exactly to the format string
13	Flow log created but no data ever arrives	S3 bucket policy missing delivery permission	`aws s3api get-bucket-policy`	Add `delivery.logs.amazonaws.com` write policy
14	`MaliciousFlow` never appears	Expecting it on non-`ExternalPublic` flows	Query `FlowType` distribution	Malicious match applies to public endpoints only

#3 — the unpartitioned-scan cost trap. By far the most expensive mistake. An Athena table over plain-text logs with no partitions reads every byte of history to answer a one-day question. Confirm: the “Data scanned” figure Athena prints after each query — if it is hundreds of GB for a recent-window query, you are scanning history. Fix: recreate flow logs as Parquet with HiveCompatiblePartitions=true, rebuild the table with partition projection, and filter every query on day. This is the change that took Meridian Pay’s bill down 95%.

#5 — the NAT-IP blind spot. The default format records dstaddr as seen at the ENI, which for egress through a NAT gateway is the NAT’s private address, and for the real internet destination behind an IGW may be masked. Your geo lookup then resolves the NAT’s region, not the attacker’s, and the exfil hides. Confirm: your flow-log LogFormat does not contain pkt-dstaddr. Fix: recreate with a custom format including pkt-srcaddr/pkt-dstaddr, and use pkt_dstaddr in every egress detection.

#9 — the missing-REJECT gap. A REJECT record only appears when a security group or NACL rule actually denies the flow. If traffic is dropped by an upstream firewall (Azure Firewall, AWS Network Firewall), by an NACL rather than a security group, or the return path is asymmetric, you may get no flow record at all — and absence of a denied flow is not proof the traffic was allowed. Confirm: trace where the drop happens; check whether the deny is at a layer you are logging. Fix: enable flow logs (and/or firewall logs) at the layer that actually makes the drop decision; for NACL-level denies, note VPC flow logs do capture NACL rejects, but firewall drops need the firewall’s own logs.

Best practices

Collect at the VNet/VPC scope, not per-NSG/per-ENI. VNet and VPC flow logs follow the resource, survive NSG reassignment (Azure), and automatically cover new ENIs (AWS). Per-NSG logs are the deprecated path.
Use a custom AWS format with pkt-dstaddr, flow-direction, and tcp-flags. These three fields are the difference between a pipeline that can detect exfil, scans, and beacons and one that cannot. The default format silently drops them.
Set a 1-minute aggregation interval for security. Beaconing and scan-cadence detections are impossible at 10 minutes. Pay the 10× volume for the fidelity.
Store AWS logs as Parquet with Hive-compatible partitions, and query with partition projection. This single discipline keeps Athena cost sane and eliminates MSCK REPAIR/crawlers.
Enable Traffic Analytics on Azure — don’t leave raw tuples in blob. The geo, classification, rule-resolution, and threat-intel enrichment is the whole reason to use flow logs for detection on Azure.
Migrate off NSG flow logs now. They stopped ingesting on 30 June 2025; any remaining NSG flow log is a blind spot. Move to VNet flow logs and NTANetAnalytics.
Gate every alert on baseline deviation plus a volume floor. Never alert on raw flow counts. Require both “N× the 7-day baseline” and “> an absolute floor” so batch traffic never pages you.
Alert on new external destinations and denied-flow rate per rule, not on volume to known endpoints. New-destination and rate-based signals are high-value and low-noise; volume-to-known-CDN is pure noise.
Keep raw retention short and set the query-layer retention deliberately. Raw storage/blob retention should be days; the cost lives in Log Analytics ingestion (Azure) and total S3 volume (AWS). Tier and expire aggressively.
Build the three-panel dashboard: top talkers, denied-by-rule, external-egress-by-country. Those three answer the overwhelming majority of investigations.
Feed every finding back into an NSG/security-group change in the same IaC repo. A detection that doesn’t change a rule is a log line. Use the detection query as the regression test after the change.
Maintain the RFC1918 + CGNAT exclusion regex centrally. Your “external egress” detections are only trustworthy if internal and carrier-grade-NAT ranges are excluded consistently.

Security notes

Least-privilege access to the flow-log data. The flow-log storage account and S3 bucket, and the Log Analytics workspace, contain a map of your entire internal network — who talks to whom, on which ports. Restrict read access to SecOps; the data is sensitive intelligence for an attacker who gains it.
Protect the pipeline’s own identity. Traffic Analytics writes with the Network Watcher service; on AWS the S3 delivery uses the log-delivery service principal via a bucket policy — scope that policy to the exact bucket ARN and prefix, never *.
Immutability for forensic-grade logs. For breach reconstruction, enable object lock / immutable blob on the flow-log store so an attacker who gains write access cannot delete the record of their own activity. Retention for forensics should outlive your incident-response window.
Encrypt at rest and in transit. Use CMK/SSE-KMS on the S3 bucket and storage account; the flow-log content is network intelligence and warrants customer-managed keys where policy requires.
Don’t let the query layer become an exfil path. Athena and Log Analytics can export results; monitor and restrict who can run large exports of the flow-log dataset — the data itself is a reconnaissance goldmine.
Correlate, don’t rely solely on, network signals. Flow-log detections are strongest when joined with identity and endpoint telemetry. A MaliciousFlow plus a suspicious sign-in plus an EDR alert on the same host is a confirmed incident; any one alone is a lead. This is the Zero Trust Architecture Blueprint: Identity, Network, and Data Pillars principle applied to detection.
Threat-intel freshness. Azure’s MaliciousFlow uses Microsoft’s feed; on AWS you own the feed — keep your intel list current or the enrichment goes stale and misses new C2 infrastructure.

Cost & sizing

The bill drivers differ sharply by platform and destination:

Cost driver	Platform	What you pay for	Control
Log Analytics ingestion	Azure	Per-GB of enriched rows into `NTANetAnalytics`	Coarsen TA interval; deliberate retention
Raw blob storage	Azure	Cheap tuple storage	Short retention (7 days)
Traffic Analytics processing	Azure	Bundled with ingestion	Interval choice
S3 storage	AWS	Per-GB of Parquet objects	Lifecycle tiering + expiry
Athena scans	AWS	Per-TB scanned	Parquet + partition projection
CloudWatch Logs	AWS	Per-GB ingest + storage (dear)	Only on critical subnets
Amazon Data Firehose	AWS	Per-GB streamed + destination	Only where real-time is needed
NAT/egress data (indirect)	Both	Bytes the logs describe	Not a log cost, but correlated

The key insight: on Azure the cost is Log Analytics ingestion, driven by the Traffic Analytics interval and workspace retention — coarsen the interval and set retention deliberately. On AWS the cost is split between S3 storage (controlled by lifecycle tiering) and Athena scans (controlled entirely by Parquet + partitioning — a well-partitioned lake is nearly free to query).

The AWS lifecycle policy that tiers and expires raw logs — hot for 30 days, cheap archive to 1 year:

{
  "Rules": [{
    "ID": "flowlog-tiering",
    "Filter": { "Prefix": "vpc/" },
    "Status": "Enabled",
    "Transitions": [{ "Days": 30, "StorageClass": "GLACIER_IR" }],
    "Expiration": { "Days": 365 }
  }]
}

Retention tiering, the trade-off:

Tier / window	AWS	Azure	Cost	Query speed	Use for
Hot (0–30 d)	S3 Standard	Log Analytics (hot)	Highest	Fast	Active detection
Warm (30–90 d)	S3 Glacier IR	LA (long-term)	Lower	Slower	Investigation
Cold / archive (90 d–years)	S3 Glacier / Deep Archive	LA archive tier	Cheapest	Restore delay	Compliance/forensics
Expiry	Lifecycle `Expiration`	Retention policy	—	—	Cost control

Rough monthly picture (order-of-magnitude): a busy VPC at 1-minute intervals can emit tens of GB/day of raw logs; as Parquet in S3 with 30-day hot + Glacier tiering that is a modest storage line, and Athena scans stay low because every query is partition-scoped. On Azure, Traffic Analytics ingestion into Log Analytics dominates — a chatty VNet pushes meaningful GB/day of enriched rows, so a 60-minute interval and 30–90 day retention are the levers when the workspace bill climbs. There is no production-scale free tier for either, but the cost of not having it — an undetected exfil, a failed audit, an unreconstructable breach — dwarfs the pipeline’s few-thousand-rupees-a-month footprint.

Interview & exam questions

1. What is a flow log fundamentally, and what does it not contain? A flow log is a per-connection record built on the 5-tuple (src/dst IP, src/dst port, protocol) plus a verdict (allowed/denied) and byte/packet counters over an aggregation window. It contains no process, user, DNS name, TLS SNI, or payload — it is pure L3/L4 metadata. That blind spot is why flow-log detection is paired with firewall/NDR (payload) and identity/endpoint telemetry (context).

2. Why does the default AWS VPC Flow Log format need customizing for security? The default (version 2) omits tcp-flags (needed to distinguish a SYN scan from established traffic), flow-direction (needed for egress-only exfil/beacon detection), and pkt-srcaddr/pkt-dstaddr (the real source/destination behind a NAT gateway or load balancer). A custom format adding these three is the difference between a pipeline that can detect scans and exfil and one that cannot.

3. Explain the NSG-flow-logs deprecation and what replaces it. Microsoft retired NSG flow logs on 30 June 2025; they stopped ingesting data after that date. The replacement is VNet flow logs, which attach to a VNet (not an NSG), see all in-scope traffic regardless of NSG assignment, survive NSG reassignment, add flow-state and encryption fields, and — with Traffic Analytics — write to the NTANetAnalytics table (the older AzureNetworkAnalytics_CL is retiring too). Any remaining NSG flow log is a detection blind spot.

4. What does Azure Traffic Analytics add over raw VNet flow logs? It enriches: geo-IP (Country), flow classification (FlowType: intra/inter-VNet, S2S, ExternalPublic, MaliciousFlow), the deciding NSG rule (AclGroup/AclRule), VM/subnet resolution, and a match against Microsoft’s known-malicious-IP feed. On AWS you build each of these yourself with lookup-table joins and regex classification. Traffic Analytics writes the enriched rows to NTANetAnalytics, which is where you run KQL detections.

5. How do you keep Athena flow-log queries affordable? Store logs as Parquet (columnar + compressed) with Hive-compatible partitions, create the table with partition projection so Athena computes partition locations from the query predicate (no MSCK REPAIR, no crawler), and filter every query on day/region so it scans one partition instead of the whole bucket. This routinely drops per-query scan from hundreds of GB to a few, cutting cost by 90%+.

6. What does a data-exfiltration flow look like, and how do you detect it? Sustained, asymmetric outbound volume to one external destination — far more bytes leaving than arriving. Detect by computing the out/in byte ratio per (source, external-destination) pair and alerting when both the ratio (e.g. > 20:1) and an absolute floor (e.g. > 50 MB out) are exceeded. On Azure filter to FlowType == "ExternalPublic" and use BytesSrcToDest/BytesDestToSrc; on AWS use flow-direction = egress and pkt_dstaddr.

7. How do you distinguish a port scan from a host sweep in flow logs? A port scan is one source hitting many destination ports on one (or few) hosts — detect with dcount(DestPort)/COUNT(DISTINCT dstport) over a threshold, refined by SYN-only tcp-flags and a high denied ratio. A host sweep is one source hitting many hosts on one port (looking for a service) — detect with dcount(DestIp)/COUNT(DISTINCT dstaddr) for a specific service port. Both map to MITRE T1046.

8. Flow logs cannot see payload — how do you detect C2 beaconing? By cadence, not content. A beacon is a small flow to the same destination at a near-constant interval. Order the flows per (source, destination) pair, compute inter-arrival gaps, and flag pairs where the standard deviation of the gap is small relative to the mean (low jitter) over many flows. This needs a 1-minute aggregation interval — a 10-minute window hides the cadence.

9. Why might known-blocked traffic produce no REJECT record? A REJECT/Denied record only appears when a security group or NACL rule actually denies the flow. If the drop happens at an upstream firewall (Azure Firewall, AWS Network Firewall), at a layer you are not logging, or the return path is asymmetric, you may get no flow record at all. Absence of a denied flow is not proof the traffic was allowed — log at the layer that makes the drop decision.

10. How do you stop a flow-log detection pipeline from generating alert fatigue? Never alert on raw flow counts. Gate every alert on baseline deviation plus a volume floor (e.g. > 10× the 7-day baseline and > 100 MB), alert on new external destinations rather than volume to known ones, alert on denied-flow rate per rule rather than per flow, make geo/exfil detections stateful (suppress once triaged), and bin into 5–10 minute windows to smooth self-similar noise.

11. What is the value of pkt_dstaddr specifically, with an example? It records the real destination behind a NAT gateway/load balancer, versus dstaddr which shows the intermediary’s address. In the scenario, a staging batch job exfiltrated to a package mirror in an unapproved region through a NAT gateway; dstaddr showed the NAT’s IP (resolving to the NAT’s approved region), while pkt_dstaddr revealed the actual mirror and its unapproved country — without it, the finding would have been invisible.

12. How do you close the loop from a flow-log finding to a control change? Every finding feeds a security-group/NSG rule change in the same IaC repo as the flow-log config: a denied-flow spike on legitimate traffic means widen the rule; an unexpected ACCEPT to an admin port means add an explicit deny; a new benign external destination gets allow-listed, a malicious one gets blocked. Then re-run the same detection query as the regression test — the flow that used to ACCEPT should now show REJECT (or vice versa).

These map to SC-200 (Security Operations Analyst) — KQL hunting, Microsoft Sentinel/Log Analytics, Traffic Analytics; AZ-700 (Network Engineer) — Network Watcher, flow logs, monitoring; and the AWS Certified Security – Specialty — VPC Flow Logs, Athena/CloudWatch Logs analysis, detection. A compact cert-mapping for revision:

Question theme	Primary cert	Objective area
Flow-log schema / missing fields	AZ-700 / AWS Security	Monitor & troubleshoot networks
Traffic Analytics enrichment / `NTANetAnalytics`	SC-200 / AZ-700	Configure & analyze network security
KQL detections / hunting	SC-200	Threat detection with KQL
Athena partitioning / cost	AWS Security	Logging & monitoring at scale
Exfil / scan / beacon detection	SC-200 / AWS Security	Detection & investigation
NSG→VNet flow-log migration	AZ-700	Design & implement network monitoring

Quick check

Name the three fields the AWS default VPC Flow Log format omits that a detection pipeline needs, and one detection each enables.
When did NSG flow logs stop ingesting data, and what table do you query for enriched Azure flow data now?
Your Athena query scans 400 GB to answer a one-day question. What two storage/table changes fix this, and roughly how much do they cut cost?
Flow logs cannot see the encrypted payload of a C2 channel. What flow-log property lets you detect the beacon anyway, and what aggregation interval is required?
You have a “blocked” port but see no REJECT records for traffic to it. Give one reason this is expected and not necessarily a gap.

Answers

tcp-flags (enables SYN-scan vs established distinction — port-scan detection), flow-direction (enables egress-only exfil/beacon detection), and pkt-srcaddr/pkt-dstaddr (reveals the real destination behind a NAT — geo-egress/exfil detection). Add all three via a custom --log-format.
NSG flow logs stopped ingesting on 30 June 2025. Query the NTANetAnalytics table (Traffic Analytics on VNet flow logs); the older AzureNetworkAnalytics_CL is retiring on the same track.
Store as Parquet and rebuild the table with Hive-compatible partitions + partition projection, then filter on day. The query scans one day’s few GB instead of all history, cutting per-query cost by over 90% (Meridian Pay saw > 95%).
The cadence — a beacon is a regular, low-jitter interval to the same destination; compute the standard deviation of inter-arrival gaps and flag low-jitter pairs. This requires a 1-minute aggregation interval; a 10-minute window hides the cadence.
A REJECT only appears when a security group or NACL rule actually denies the flow. If the drop happens at an upstream firewall, at a layer you are not logging, or via an asymmetric return path, there is no flow record — absence is not proof of allow. Log at the layer that makes the drop decision.

Glossary

5-tuple — source IP, destination IP, source port, destination port, protocol; the identity of a network flow and the basis of every detection.
Flow log — a per-connection record of the 5-tuple plus verdict and byte/packet counters over an aggregation window; pure L3/L4, no payload.
NSG flow log (v2) — the legacy Azure flow log attached to a network security group; retired (stopped ingesting) 30 June 2025.
VNet flow log — the modern Azure flow log attached to a VNet/subnet/NIC; sees all in-scope traffic, survives NSG reassignment, adds flow-state and encryption fields.
Traffic Analytics — the Azure processing layer that enriches raw flow logs (geo, classification, deciding rule, threat-intel) and writes them to Log Analytics.
NTANetAnalytics — the modern Traffic Analytics table in Log Analytics; the surface for all Azure flow-log KQL detections.
VPC Flow Log — the AWS per-ENI/subnet/VPC/TGW flow record, delivered to S3, CloudWatch Logs, or Amazon Data Firehose.
Default vs custom format (AWS) — the default (v2) field subset vs a chosen field set; custom adds tcp-flags, flow-direction, pkt-srcaddr/pkt-dstaddr and more.
pkt-dstaddr / pkt-srcaddr — the real destination/source behind a NAT gateway or load balancer, versus the intermediary’s address.
flow-direction — AWS field marking a record as ingress or egress.
tcp-flags — a bitwise OR of TCP flags seen in the window (2 = SYN, 18 = SYN-ACK, 1 = FIN, 4 = RST); a lone 2 is a half-open SYN scan.
Aggregation interval — the window a single record summarises (1 or 10 minutes on AWS); 1-minute is required for timing detections.
Partition projection — an Athena feature that computes partition S3 locations from the query predicate, avoiding MSCK REPAIR/crawlers and enabling scoped scans.
FlowType — Traffic Analytics classification: IntraVNet, InterVNet, S2S, ExternalPublic, MaliciousFlow, etc.
MaliciousFlow — a Traffic Analytics FlowType/column marking a flow matched against a known-malicious-IP feed; the highest-signal starting point.
C2 beaconing — a compromised host’s regular, low-jitter heartbeat to a controller; detectable by cadence in flow logs despite the encrypted payload.
Exfiltration (byte-ratio) — sustained asymmetric outbound volume (far more out than in) to an external destination; the exfil fingerprint at L3/L4.
Denied-flow spike — a surge in REJECT/Denied records against one rule, signalling scanning or a misconfigured client.
RFC1918 / CGNAT exclusion — the regex that keeps private (10/8, 172.16/12, 192.168/16) and carrier-grade-NAT (100.64/10) ranges out of “external egress” results so they show only public destinations.

Next steps

You can now turn flow logs into an active detection control on both clouds. Build outward:

Next: KQL Threat Hunting Playbooks: MITRE ATT&CK Mapping, UEBA, and Hunting Notebooks — fold these network hunts into a full ATT&CK-mapped hunting program.
Related: Micro-Segmentation with NSGs and Application Security Groups: Tier Isolation at Scale — the segmentation control your east-west detections verify.
Related: Centralized Internet Egress: FQDN Filtering, Explicit Proxy, and TLS Inspection — the egress chokepoint where you remediate exfil and unapproved-geo findings.
Related: AWS Network Firewall in Production: Suricata Rule Engineering for Egress Inspection — payload-layer inspection that fills the flow-log blind spot on AWS.
Related: When Logs Aren’t Enough: Packet Capture, Traffic Mirroring, and Deep Network Troubleshooting — when you need per-packet truth beyond sampled flow logs.
Related: Engineering Incident Response: Runbooks, Tabletop Exercises, and Cloud Forensics — using flow logs as the primary artefact to reconstruct a breach.