Most Kubernetes “network observability” is a sidecar tax: an Envoy per pod, double the memory, and a service mesh you now have to operate just to see who talks to whom. Hubble takes the other road. Because Cilium already runs the datapath as eBPF programs attached to every pod’s virtual interface, the data plane already sees every packet — Hubble just asks the kernel to emit a structured event for each flow it processes. No sidecars, no app changes, and L7 visibility for HTTP/gRPC/DNS when you opt in per-workload. This is the working setup: the architecture, enabling flow and L7 visibility, reading flows to debug policy drops, exporting metrics to Prometheus, building the service map, shipping flows to long-term storage, and what it actually costs at scale.
1. The architecture: datapath, Relay, and the UI
Hubble has three layers, and confusing them is the source of most “why can’t I see flows” tickets.
per node cluster-wide
+-------------------------------+
| cilium-agent (DaemonSet) |
| eBPF datapath -> ring buffer |
| Hubble embedded server |---gRPC :4244--+
| (node-local flow API) | |
+-------------------------------+ v
+-------------------------------+ [ Hubble Relay ] --:4245--> hubble CLI
| cilium-agent (other nodes) |---:4244-> (aggregates all --:80 -----> Hubble UI
| Hubble embedded server | node servers into
+-------------------------------+ one cluster view)
- eBPF datapath. Cilium’s programs run on the tc/XDP hooks of every endpoint. As they make forwarding and policy decisions, they push events (
trace,drop,policy-verdict, plus parsed L7 records) into a per-CPU ring buffer. This is the only place packets are touched; everything above is read-only consumption. - Hubble server (embedded in
cilium-agent). Each agent reads its node’s ring buffer, converts raw events into theFlowprotobuf, keeps the last N flows in an in-memory ring (default 4095 per node), and serves them on a node-local gRPC API. This is per node — it only knows about flows that traversed its own endpoints. - Hubble Relay. A small Deployment that fans out to every node’s Hubble server and presents a single cluster-wide gRPC API on port 4245. The
hubbleCLI and the UI both talk to Relay, not to individual agents. - Hubble UI. A web frontend that consumes Relay and renders the live service map plus a flow table.
The mental model that keeps you sane:
The agent’s in-memory ring buffer is the only persistent state, and it is tiny and lossy by design. Hubble is a live tap, not a database. Anything you want to keep — metrics, long-term flow logs — has to be exported out to Prometheus or an external sink. Treat
hubble observeastcpdump, not as Splunk.
2. Enable Hubble, Relay, and the UI
Assuming Cilium is already your CNI. Enable the three pieces with the Cilium CLI:
# Turn on the embedded server, Relay, and UI
cilium hubble enable --ui
# Wait for everything to settle, then sanity-check
cilium status --wait
Equivalent Helm values if you manage Cilium with Helm/GitOps (the correct path for production):
# values.yaml for the cilium/cilium chart
hubble:
enabled: true
relay:
enabled: true
ui:
enabled: true
# mTLS between Relay and node servers; auto-issues certs
tls:
auto:
enabled: true
method: helm
helm upgrade cilium cilium/cilium --namespace kube-system \
--reuse-values -f values.yaml
kubectl -n kube-system rollout status ds/cilium
Install the CLI and point it at Relay via a port-forward (the CLI runs the forward for you):
# One-time CLI install (Linux amd64 shown)
HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
curl -L --fail -o hubble.tar.gz \
"https://github.com/cilium/hubble/releases/download/${HUBBLE_VERSION}/hubble-linux-amd64.tar.gz"
sudo tar xzf hubble.tar.gz -C /usr/local/bin hubble
# Forward Relay (defaults to localhost:4245) and confirm node count
cilium hubble port-forward &
hubble status
hubble status should report every node as connected. If a node shows up unavailable, its agent’s embedded server is unhealthy — check that node’s cilium-agent, not Relay.
3. Turn on flow visibility and L7 protocol parsing
Out of the box you get L3/L4 flows for free — every pod-to-pod connection, with verdict (forwarded/dropped) and the Cilium identity on each side. No configuration needed; it is a property of the datapath.
L7 visibility (decoded HTTP methods, paths, status codes, gRPC services, DNS queries) is not free, because it requires Cilium to redirect matching traffic through its in-kernel/userspace L7 proxy. You enable it per workload by attaching an L7 rule in a CiliumNetworkPolicy. The act of specifying an L7 rule is what switches on parsing for that traffic — you are not just filtering, you are asking Cilium to look inside.
DNS visibility for a namespace (this also lets you write FQDN-based policy later):
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: dns-visibility
namespace: shop
spec:
endpointSelector: {} # all pods in the namespace
egress:
- toEndpoints:
- matchLabels:
k8s:io.kubernetes.pod.namespace: kube-system
k8s-app: kube-dns
toPorts:
- ports:
- port: "53"
protocol: ANY
rules:
dns:
- matchPattern: "*" # parse every DNS query, allow all
HTTP visibility on a service’s ingress — here we still allow everything, we just want it parsed:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: api-http-visibility
namespace: shop
spec:
endpointSelector:
matchLabels:
app: api
ingress:
- toPorts:
- ports:
- port: "8080"
protocol: TCP
rules:
http:
- {} # empty rule = match (and parse) all HTTP
Now watch the decoded L7 stream:
hubble observe --namespace shop --protocol http --follow
Sample output — note the method, path, status, and latency are all present without touching the app:
Jun 8 19:42:11.004: shop/web-7c-abc:51324 -> shop/api-7d-xyz:8080 http-request FORWARDED (HTTP/1.1 GET /v1/cart)
Jun 8 19:42:11.009: shop/api-7d-xyz:8080 -> shop/web-7c-abc:51324 http-response FORWARDED (HTTP/1.1 200 GET /v1/cart) took 4.91ms
Cost-control rule: an empty L7 rule (
http: [{}]) sends all of that port’s traffic through the proxy, which adds per-request CPU and a small latency hop. Scope L7 visibility to the services you are actively debugging or whose golden signals you genuinely need. Do not blanket-enable HTTP parsing cluster-wide and then wonder why p99 moved.
4. Read flows to debug drops and policy denials
This is where Hubble earns its keep. The classic failure — “service A can’t reach service B and nobody knows why” — is a two-minute investigation instead of an afternoon of tcpdump and guesswork.
Filter for everything Cilium dropped, across the namespace:
# All non-forwarded verdicts, last few minutes, live
hubble observe --namespace shop --verdict DROPPED --follow
A policy denial is unambiguous in the output:
Jun 8 19:51:03.880: shop/web-7c-abc:44120 -> shop/payments-9a-def:9000 \
policy-verdict:none EGRESS DENIED (TCP Flags: SYN)
Jun 8 19:51:03.880: shop/web-7c-abc:44120 -> shop/payments-9a-def:9000 \
Policy denied DROPPED (TCP Flags: SYN)
policy-verdict:none means no policy allowed this — the egress from web to payments isn’t permitted. The fix is a policy that allows it, not a firewall change.
The flags that matter in real investigations:
| Need | Flag |
|---|---|
| Only dropped traffic | --verdict DROPPED |
| Why it dropped (reason) | --type drop then read --output json .drop_reason_desc |
| Scope to a pod | --pod shop/web-7c-abc |
| One direction | --from-pod ... / --to-pod ... |
| By DNS name (needs FQDN parsing) | --fqdn api.shop.svc.cluster.local |
| Identity-level view | --from-identity <id> / --label app=api |
| Last N from the ring (not live) | --last 200 |
| Machine-readable | -o json (or -o jsonpb) |
When the cause isn’t obvious, go to JSON and read the structured drop reason and the L4 details:
hubble observe --namespace shop --type drop --last 50 -o json \
| jq '{t:.time, src:.source.pod_name, dst:.destination.pod_name,
dport:.l4.TCP.destination_port, reason:.drop_reason_desc}'
Common drop_reason_desc values you will actually see: Policy denied (no allow rule), Stale or unroutable IP (endpoint churn / stale conntrack), Unsupported L3 protocol, and CT: Map insertion failed (conntrack table pressure — a capacity signal, not a policy one).
5. Export Hubble metrics to Prometheus and Grafana
Flows are ephemeral; metrics are how you keep the signal. Hubble can aggregate flows into Prometheus metrics directly in the agent — this is independent of the flow ring buffer and is what you alert on.
Enable the metrics you want via Helm. Each metric name can take context options (e.g. labelsContext, sourceContext) that control label cardinality — choose carefully, this is your cardinality budget:
hubble:
enabled: true
metrics:
# Each entry is a metric with its context/label options
enabled:
- "dns:query;ignoreAAAA"
- "drop:sourceContext=identity;destinationContext=identity"
- "tcp"
- "flow:sourceContext=workload-name;destinationContext=workload-name"
- "port-distribution"
- "icmp"
- "httpV2:exemplars=true;labelsContext=source_namespace,destination_namespace"
# Expose the metrics endpoint and ship a ServiceMonitor
serviceMonitor:
enabled: true
dashboards:
enabled: true # provisions the official Hubble Grafana dashboards
Apply and confirm the agent is exposing Hubble metrics (separate from Cilium’s own :9962):
helm upgrade cilium cilium/cilium -n kube-system --reuse-values -f values.yaml
kubectl -n kube-system rollout status ds/cilium
# Hubble metrics default to port 9965 on each agent
kubectl -n kube-system port-forward ds/cilium 9965:9965 &
curl -s localhost:9965/metrics | grep -E 'hubble_(drop|http|dns)_' | head
The metrics you actually build SLOs and alerts on:
| Metric | What it tells you |
|---|---|
hubble_drop_total |
dropped packets, labeled by reason/protocol/identity |
hubble_flows_processed_total |
total flows by verdict, type, subtype |
hubble_http_requests_total |
L7 request count by method/status (requires httpV2) |
hubble_http_request_duration_seconds |
HTTP latency histogram (gives you RED golden signals, with exemplars) |
hubble_dns_queries_total / hubble_dns_responses_total |
DNS query volume and rcode (catch NXDOMAIN storms) |
hubble_tcp_flags_total |
RST/SYN distribution — connection-churn smell test |
A PromQL alert that flags policy-driven drops between workloads (the kind that silently breaks a feature):
# Sustained drops attributable to policy, by destination identity
sum by (destination, reason) (
rate(hubble_drop_total{reason="POLICY_DENIED"}[5m])
) > 0
Because httpV2 carries exemplars, a latency spike on the hubble_http_request_duration_seconds panel in Grafana links straight to the trace — the same exemplar workflow you use for app metrics, now for the network layer.
6. Build the live service map and find rogue dependencies
Open the UI:
cilium hubble ui # opens http://localhost:12000 via port-forward
Pick a namespace and the UI renders a live, eBPF-derived service map: every workload as a node, every observed flow as an edge, L7 edges labeled with HTTP path or DNS name. This is not a static diagram someone drew in a wiki two years ago and never updated — it is what is actually talking right now, reconstructed from kernel events.
What this catches that architecture docs never do:
- Unexpected egress. A pod reaching
api.stripe.comor an internal service it has no business calling — visible immediately as an edge to aworldor cross-namespace identity. - Cross-namespace coupling you thought you’d severed, e.g.
frontendstill hittinglegacy-monolithin another namespace. - DNS to surprising destinations, often the first sign of a misconfigured client or exfiltration.
To do the same headless (for CI checks or audits), derive the dependency set from flows as JSON:
# Unique src-namespace/app -> dst-namespace/app edges seen in the last 30 min
hubble observe --namespace shop --last 5000 -o json \
| jq -r '[.source.namespace, .source.labels[]? | select(startswith("k8s:app=")),
"->",
.destination.namespace, .destination.labels[]? | select(startswith("k8s:app="))]
| @tsv' \
| sort -u
Diff that set against an allowlist in CI and you have drift detection for your actual network topology, not your intended one.
7. Export flows to an external sink for retention
The ring buffer holds minutes, maybe seconds under load. For incident forensics, compliance, or “what did this pod talk to last Tuesday,” you must export flows off-node. Cilium supports a built-in FlowLog export: the agent writes flows as JSON to a file (typically a host path you tail with your log shipper), with optional field redaction and filtering so you don’t drown in or leak data.
Configure export via Helm (flowlogs.config), which writes a flowlog.yaml the agent reads:
hubble:
enabled: true
export:
dynamic:
enabled: true
config:
enabled: true
content:
- name: "all-l7"
filePath: "/var/run/cilium/hubble/events.log"
fieldMask:
- time
- source.namespace
- source.pod_name
- destination.namespace
- destination.pod_name
- l4
- l7
- verdict
includeFilters:
- source_pod: ["shop/"] # only export shop namespace
excludeFilters: []
The agent now appends newline-delimited JSON to that path on each node. Ship it the normal way — a Fluent Bit / Vector / Promtail tail on the host path into Loki, OpenSearch, or S3 — and you have queryable, retained flow logs decoupled from Hubble’s tiny buffer. Keep the fieldMask tight: full L7 payloads are large and may contain sensitive headers, so export only the fields your investigations need.
8. Performance overhead and scaling Relay
Hubble’s cost has two distinct buckets, and you size them separately.
L3/L4 flow generation is nearly free. Emitting a flow event from the eBPF datapath is a ring-buffer write the agent drains asynchronously. The marginal CPU per flow is negligible; the bounded cost is the agent’s in-memory ring buffer (tune with --hubble-event-buffer-capacity; default 4095 events/node). More buffer = more RAM per agent, nothing more.
L7 parsing is where the real cost lives, because matched traffic detours through Cilium’s L7 proxy. Budget low-single-digit-percent extra CPU and a sub-millisecond latency hop on the proxied paths only. This is precisely why L7 is opt-in per policy: you pay only for what you instrument.
Relay is the component that strains in big clusters. It maintains a persistent gRPC stream to every node’s Hubble server, so its fan-out grows with node count. On large clusters:
hubble:
relay:
replicas: 3 # HA + spreads the fan-out
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
memory: 1Gi # cap it; a wide `observe` can spike memory
prometheus:
enabled: true # scrape hubble_relay_* to watch its own health
Operational guardrails that prevent self-inflicted incidents:
- Treat a broad, unfiltered
hubble observe --followacross a 500-node cluster as a load-generating query against Relay. Always scope with--namespace/--podfilters. - Keep the per-node event buffer modest; you offload retention to the export sink (Step 7), not to a giant in-memory ring.
- Watch
hubble_relay_*metrics. If Relay can’t keep up with a node, you’ll see it lose that node’s stream — flows go missing silently, which looks like a “Hubble is broken” ticket but is really a capacity signal.
Enterprise scenario
A payments platform team running ~280 nodes per cluster across three regions got a hard mandate from security after an audit: prove that PCI-scoped workloads (namespace: cardholder) never egress to anything outside an approved allowlist, and retain the evidence for 12 months. They had no service mesh and were not going to deploy Envoy sidecars across 4,000 pods to get it.
The constraint that bit them: Hubble’s in-memory ring held roughly 30 seconds of flows on their busiest nodes, so “the UI shows no rogue egress” was not auditable evidence — by the time anyone looked, the window was gone.
They solved it with eBPF-native enforcement plus FlowLog export, no mesh:
- FQDN allowlist policy in the
cardholdernamespace — this both restricts egress and turns on DNS+L7 parsing for the proof, in one object:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: cardholder-egress-allowlist
namespace: cardholder
spec:
endpointSelector: {}
egress:
- toEndpoints: # in-cluster DNS so FQDN rules resolve
- matchLabels:
k8s:io.kubernetes.pod.namespace: kube-system
k8s-app: kube-dns
toPorts:
- ports: [{ port: "53", protocol: ANY }]
rules: { dns: [ { matchPattern: "*" } ] }
- toFQDNs: # the ONLY approved external destination
- matchName: "api.acquirer.example.com"
toPorts:
- ports: [{ port: "443", protocol: TCP }]
- FlowLog export (Step 7) of the
cardholdernamespace to a host path, tailed by Vector into an S3 bucket with Object Lock (WORM) for the 12-month immutable retention the auditor required. - A scheduled job diffed the exported flows against the allowlist and alerted on any
world-identity egress that wasn’tapi.acquirer.example.com— defense in depth behind the policy that already denied it.
The result: enforcement and tamper-evident audit trail, zero sidecars, and a ~2% CPU bump confined to the cardholder nodes that were actually parsed. The decisive move was treating the ephemeral ring buffer as a live tap and pushing the evidence to immutable storage — Hubble for visibility and enforcement, S3 Object Lock for retention.
Verify
Walk this end to end before declaring victory:
# 1) Every node's Hubble server is reachable through Relay
hubble status # expect: all nodes connected, no "unavailable"
# 2) L3/L4 flows are visible with no policy at all
hubble observe --namespace shop --last 20
# 3) L7 is actually being parsed after applying an http/dns CNP
hubble observe --namespace shop --protocol http --last 20 # method/path/status present
hubble observe --namespace shop --type l7 --protocol dns --last 20
# 4) A deliberate policy drop shows the right verdict + reason
hubble observe --namespace shop --verdict DROPPED -o json \
| jq '.drop_reason_desc' | sort | uniq -c
# 5) Metrics are exposed and scraped
curl -s localhost:9965/metrics | grep -E 'hubble_(drop|http|dns)_total' | head
# 6) FlowLog export is landing on the node
kubectl -n kube-system exec ds/cilium -- \
tail -n 3 /var/run/cilium/hubble/events.log
Expectation: steps 2-4 return flows with correct verdicts and (for L7) decoded protocol fields; step 5 returns non-empty Hubble metric series; step 6 shows fresh JSON lines.