Observability Kubernetes

Network Observability with Cilium Hubble: Flow Logs, L7 Visibility, and Service Maps

Most Kubernetes “network observability” is a sidecar tax: an Envoy per pod, double the memory, and a service mesh you now have to operate just to see who talks to whom. Hubble takes the other road. Because Cilium already runs the datapath as eBPF programs attached to every pod’s virtual interface, the data plane already sees every packet — Hubble just asks the kernel to emit a structured event for each flow it processes. No sidecars, no app changes, and L7 visibility for HTTP/gRPC/DNS when you opt in per-workload. This is the working setup: the architecture, enabling flow and L7 visibility, reading flows to debug policy drops, exporting metrics to Prometheus, building the service map, shipping flows to long-term storage, and what it actually costs at scale.

1. The architecture: datapath, Relay, and the UI

Hubble has three layers, and confusing them is the source of most “why can’t I see flows” tickets.

  per node                                cluster-wide
 +-------------------------------+
 | cilium-agent (DaemonSet)      |
 |  eBPF datapath -> ring buffer |
 |  Hubble embedded server       |---gRPC :4244--+
 |   (node-local flow API)       |               |
 +-------------------------------+               v
 +-------------------------------+        [ Hubble Relay ]  --:4245-->  hubble CLI
 | cilium-agent (other nodes)    |---:4244->  (aggregates all       --:80 ----->  Hubble UI
 |  Hubble embedded server       |             node servers into
 +-------------------------------+             one cluster view)

The mental model that keeps you sane:

The agent’s in-memory ring buffer is the only persistent state, and it is tiny and lossy by design. Hubble is a live tap, not a database. Anything you want to keep — metrics, long-term flow logs — has to be exported out to Prometheus or an external sink. Treat hubble observe as tcpdump, not as Splunk.

2. Enable Hubble, Relay, and the UI

Assuming Cilium is already your CNI. Enable the three pieces with the Cilium CLI:

# Turn on the embedded server, Relay, and UI
cilium hubble enable --ui

# Wait for everything to settle, then sanity-check
cilium status --wait

Equivalent Helm values if you manage Cilium with Helm/GitOps (the correct path for production):

# values.yaml for the cilium/cilium chart
hubble:
  enabled: true
  relay:
    enabled: true
  ui:
    enabled: true
  # mTLS between Relay and node servers; auto-issues certs
  tls:
    auto:
      enabled: true
      method: helm
helm upgrade cilium cilium/cilium --namespace kube-system \
  --reuse-values -f values.yaml
kubectl -n kube-system rollout status ds/cilium

Install the CLI and point it at Relay via a port-forward (the CLI runs the forward for you):

# One-time CLI install (Linux amd64 shown)
HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
curl -L --fail -o hubble.tar.gz \
  "https://github.com/cilium/hubble/releases/download/${HUBBLE_VERSION}/hubble-linux-amd64.tar.gz"
sudo tar xzf hubble.tar.gz -C /usr/local/bin hubble

# Forward Relay (defaults to localhost:4245) and confirm node count
cilium hubble port-forward &
hubble status

hubble status should report every node as connected. If a node shows up unavailable, its agent’s embedded server is unhealthy — check that node’s cilium-agent, not Relay.

3. Turn on flow visibility and L7 protocol parsing

Out of the box you get L3/L4 flows for free — every pod-to-pod connection, with verdict (forwarded/dropped) and the Cilium identity on each side. No configuration needed; it is a property of the datapath.

L7 visibility (decoded HTTP methods, paths, status codes, gRPC services, DNS queries) is not free, because it requires Cilium to redirect matching traffic through its in-kernel/userspace L7 proxy. You enable it per workload by attaching an L7 rule in a CiliumNetworkPolicy. The act of specifying an L7 rule is what switches on parsing for that traffic — you are not just filtering, you are asking Cilium to look inside.

DNS visibility for a namespace (this also lets you write FQDN-based policy later):

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: dns-visibility
  namespace: shop
spec:
  endpointSelector: {}            # all pods in the namespace
  egress:
    - toEndpoints:
        - matchLabels:
            k8s:io.kubernetes.pod.namespace: kube-system
            k8s-app: kube-dns
      toPorts:
        - ports:
            - port: "53"
              protocol: ANY
          rules:
            dns:
              - matchPattern: "*"   # parse every DNS query, allow all

HTTP visibility on a service’s ingress — here we still allow everything, we just want it parsed:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: api-http-visibility
  namespace: shop
spec:
  endpointSelector:
    matchLabels:
      app: api
  ingress:
    - toPorts:
        - ports:
            - port: "8080"
              protocol: TCP
          rules:
            http:
              - {}                  # empty rule = match (and parse) all HTTP

Now watch the decoded L7 stream:

hubble observe --namespace shop --protocol http --follow

Sample output — note the method, path, status, and latency are all present without touching the app:

Jun  8 19:42:11.004: shop/web-7c-abc:51324 -> shop/api-7d-xyz:8080 http-request FORWARDED (HTTP/1.1 GET /v1/cart)
Jun  8 19:42:11.009: shop/api-7d-xyz:8080 -> shop/web-7c-abc:51324 http-response FORWARDED (HTTP/1.1 200 GET /v1/cart) took 4.91ms

Cost-control rule: an empty L7 rule (http: [{}]) sends all of that port’s traffic through the proxy, which adds per-request CPU and a small latency hop. Scope L7 visibility to the services you are actively debugging or whose golden signals you genuinely need. Do not blanket-enable HTTP parsing cluster-wide and then wonder why p99 moved.

4. Read flows to debug drops and policy denials

This is where Hubble earns its keep. The classic failure — “service A can’t reach service B and nobody knows why” — is a two-minute investigation instead of an afternoon of tcpdump and guesswork.

Filter for everything Cilium dropped, across the namespace:

# All non-forwarded verdicts, last few minutes, live
hubble observe --namespace shop --verdict DROPPED --follow

A policy denial is unambiguous in the output:

Jun  8 19:51:03.880: shop/web-7c-abc:44120 -> shop/payments-9a-def:9000 \
  policy-verdict:none EGRESS DENIED (TCP Flags: SYN)
Jun  8 19:51:03.880: shop/web-7c-abc:44120 -> shop/payments-9a-def:9000 \
  Policy denied DROPPED (TCP Flags: SYN)

policy-verdict:none means no policy allowed this — the egress from web to payments isn’t permitted. The fix is a policy that allows it, not a firewall change.

The flags that matter in real investigations:

Need Flag
Only dropped traffic --verdict DROPPED
Why it dropped (reason) --type drop then read --output json .drop_reason_desc
Scope to a pod --pod shop/web-7c-abc
One direction --from-pod ... / --to-pod ...
By DNS name (needs FQDN parsing) --fqdn api.shop.svc.cluster.local
Identity-level view --from-identity <id> / --label app=api
Last N from the ring (not live) --last 200
Machine-readable -o json (or -o jsonpb)

When the cause isn’t obvious, go to JSON and read the structured drop reason and the L4 details:

hubble observe --namespace shop --type drop --last 50 -o json \
  | jq '{t:.time, src:.source.pod_name, dst:.destination.pod_name,
         dport:.l4.TCP.destination_port, reason:.drop_reason_desc}'

Common drop_reason_desc values you will actually see: Policy denied (no allow rule), Stale or unroutable IP (endpoint churn / stale conntrack), Unsupported L3 protocol, and CT: Map insertion failed (conntrack table pressure — a capacity signal, not a policy one).

5. Export Hubble metrics to Prometheus and Grafana

Flows are ephemeral; metrics are how you keep the signal. Hubble can aggregate flows into Prometheus metrics directly in the agent — this is independent of the flow ring buffer and is what you alert on.

Enable the metrics you want via Helm. Each metric name can take context options (e.g. labelsContext, sourceContext) that control label cardinality — choose carefully, this is your cardinality budget:

hubble:
  enabled: true
  metrics:
    # Each entry is a metric with its context/label options
    enabled:
      - "dns:query;ignoreAAAA"
      - "drop:sourceContext=identity;destinationContext=identity"
      - "tcp"
      - "flow:sourceContext=workload-name;destinationContext=workload-name"
      - "port-distribution"
      - "icmp"
      - "httpV2:exemplars=true;labelsContext=source_namespace,destination_namespace"
    # Expose the metrics endpoint and ship a ServiceMonitor
    serviceMonitor:
      enabled: true
    dashboards:
      enabled: true        # provisions the official Hubble Grafana dashboards

Apply and confirm the agent is exposing Hubble metrics (separate from Cilium’s own :9962):

helm upgrade cilium cilium/cilium -n kube-system --reuse-values -f values.yaml
kubectl -n kube-system rollout status ds/cilium

# Hubble metrics default to port 9965 on each agent
kubectl -n kube-system port-forward ds/cilium 9965:9965 &
curl -s localhost:9965/metrics | grep -E 'hubble_(drop|http|dns)_' | head

The metrics you actually build SLOs and alerts on:

Metric What it tells you
hubble_drop_total dropped packets, labeled by reason/protocol/identity
hubble_flows_processed_total total flows by verdict, type, subtype
hubble_http_requests_total L7 request count by method/status (requires httpV2)
hubble_http_request_duration_seconds HTTP latency histogram (gives you RED golden signals, with exemplars)
hubble_dns_queries_total / hubble_dns_responses_total DNS query volume and rcode (catch NXDOMAIN storms)
hubble_tcp_flags_total RST/SYN distribution — connection-churn smell test

A PromQL alert that flags policy-driven drops between workloads (the kind that silently breaks a feature):

# Sustained drops attributable to policy, by destination identity
sum by (destination, reason) (
  rate(hubble_drop_total{reason="POLICY_DENIED"}[5m])
) > 0

Because httpV2 carries exemplars, a latency spike on the hubble_http_request_duration_seconds panel in Grafana links straight to the trace — the same exemplar workflow you use for app metrics, now for the network layer.

6. Build the live service map and find rogue dependencies

Open the UI:

cilium hubble ui            # opens http://localhost:12000 via port-forward

Pick a namespace and the UI renders a live, eBPF-derived service map: every workload as a node, every observed flow as an edge, L7 edges labeled with HTTP path or DNS name. This is not a static diagram someone drew in a wiki two years ago and never updated — it is what is actually talking right now, reconstructed from kernel events.

What this catches that architecture docs never do:

To do the same headless (for CI checks or audits), derive the dependency set from flows as JSON:

# Unique src-namespace/app -> dst-namespace/app edges seen in the last 30 min
hubble observe --namespace shop --last 5000 -o json \
  | jq -r '[.source.namespace, .source.labels[]? | select(startswith("k8s:app=")),
            "->",
            .destination.namespace, .destination.labels[]? | select(startswith("k8s:app="))]
           | @tsv' \
  | sort -u

Diff that set against an allowlist in CI and you have drift detection for your actual network topology, not your intended one.

7. Export flows to an external sink for retention

The ring buffer holds minutes, maybe seconds under load. For incident forensics, compliance, or “what did this pod talk to last Tuesday,” you must export flows off-node. Cilium supports a built-in FlowLog export: the agent writes flows as JSON to a file (typically a host path you tail with your log shipper), with optional field redaction and filtering so you don’t drown in or leak data.

Configure export via Helm (flowlogs.config), which writes a flowlog.yaml the agent reads:

hubble:
  enabled: true
  export:
    dynamic:
      enabled: true
      config:
        enabled: true
        content:
          - name: "all-l7"
            filePath: "/var/run/cilium/hubble/events.log"
            fieldMask:
              - time
              - source.namespace
              - source.pod_name
              - destination.namespace
              - destination.pod_name
              - l4
              - l7
              - verdict
            includeFilters:
              - source_pod: ["shop/"]      # only export shop namespace
            excludeFilters: []

The agent now appends newline-delimited JSON to that path on each node. Ship it the normal way — a Fluent Bit / Vector / Promtail tail on the host path into Loki, OpenSearch, or S3 — and you have queryable, retained flow logs decoupled from Hubble’s tiny buffer. Keep the fieldMask tight: full L7 payloads are large and may contain sensitive headers, so export only the fields your investigations need.

8. Performance overhead and scaling Relay

Hubble’s cost has two distinct buckets, and you size them separately.

L3/L4 flow generation is nearly free. Emitting a flow event from the eBPF datapath is a ring-buffer write the agent drains asynchronously. The marginal CPU per flow is negligible; the bounded cost is the agent’s in-memory ring buffer (tune with --hubble-event-buffer-capacity; default 4095 events/node). More buffer = more RAM per agent, nothing more.

L7 parsing is where the real cost lives, because matched traffic detours through Cilium’s L7 proxy. Budget low-single-digit-percent extra CPU and a sub-millisecond latency hop on the proxied paths only. This is precisely why L7 is opt-in per policy: you pay only for what you instrument.

Relay is the component that strains in big clusters. It maintains a persistent gRPC stream to every node’s Hubble server, so its fan-out grows with node count. On large clusters:

hubble:
  relay:
    replicas: 3                 # HA + spreads the fan-out
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        memory: 1Gi             # cap it; a wide `observe` can spike memory
    prometheus:
      enabled: true             # scrape hubble_relay_* to watch its own health

Operational guardrails that prevent self-inflicted incidents:

Enterprise scenario

A payments platform team running ~280 nodes per cluster across three regions got a hard mandate from security after an audit: prove that PCI-scoped workloads (namespace: cardholder) never egress to anything outside an approved allowlist, and retain the evidence for 12 months. They had no service mesh and were not going to deploy Envoy sidecars across 4,000 pods to get it.

The constraint that bit them: Hubble’s in-memory ring held roughly 30 seconds of flows on their busiest nodes, so “the UI shows no rogue egress” was not auditable evidence — by the time anyone looked, the window was gone.

They solved it with eBPF-native enforcement plus FlowLog export, no mesh:

  1. FQDN allowlist policy in the cardholder namespace — this both restricts egress and turns on DNS+L7 parsing for the proof, in one object:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: cardholder-egress-allowlist
  namespace: cardholder
spec:
  endpointSelector: {}
  egress:
    - toEndpoints:                    # in-cluster DNS so FQDN rules resolve
        - matchLabels:
            k8s:io.kubernetes.pod.namespace: kube-system
            k8s-app: kube-dns
      toPorts:
        - ports: [{ port: "53", protocol: ANY }]
          rules: { dns: [ { matchPattern: "*" } ] }
    - toFQDNs:                        # the ONLY approved external destination
        - matchName: "api.acquirer.example.com"
      toPorts:
        - ports: [{ port: "443", protocol: TCP }]
  1. FlowLog export (Step 7) of the cardholder namespace to a host path, tailed by Vector into an S3 bucket with Object Lock (WORM) for the 12-month immutable retention the auditor required.
  2. A scheduled job diffed the exported flows against the allowlist and alerted on any world-identity egress that wasn’t api.acquirer.example.com — defense in depth behind the policy that already denied it.

The result: enforcement and tamper-evident audit trail, zero sidecars, and a ~2% CPU bump confined to the cardholder nodes that were actually parsed. The decisive move was treating the ephemeral ring buffer as a live tap and pushing the evidence to immutable storage — Hubble for visibility and enforcement, S3 Object Lock for retention.

Verify

Walk this end to end before declaring victory:

# 1) Every node's Hubble server is reachable through Relay
hubble status        # expect: all nodes connected, no "unavailable"

# 2) L3/L4 flows are visible with no policy at all
hubble observe --namespace shop --last 20

# 3) L7 is actually being parsed after applying an http/dns CNP
hubble observe --namespace shop --protocol http --last 20   # method/path/status present
hubble observe --namespace shop --type l7 --protocol dns --last 20

# 4) A deliberate policy drop shows the right verdict + reason
hubble observe --namespace shop --verdict DROPPED -o json \
  | jq '.drop_reason_desc' | sort | uniq -c

# 5) Metrics are exposed and scraped
curl -s localhost:9965/metrics | grep -E 'hubble_(drop|http|dns)_total' | head

# 6) FlowLog export is landing on the node
kubectl -n kube-system exec ds/cilium -- \
  tail -n 3 /var/run/cilium/hubble/events.log

Expectation: steps 2-4 return flows with correct verdicts and (for L7) decoded protocol fields; step 5 returns non-empty Hubble metric series; step 6 shows fresh JSON lines.

Checklist

ebpfciliumhubblekubernetesobservability

Comments

Keep Reading