GKE Dataplane V2: Cilium-Based Network Policy and Observability

Most GKE clusters that “have network policy” actually have an iptables-based enforcement plane bolted onto kube-proxy, and the moment you ask it a hard question — which pod talked to which external host, and was that connection allowed or dropped? — it goes silent. GKE Dataplane V2 changes the substrate underneath. It replaces kube-proxy’s iptables service routing with eBPF programs in the kernel, ships a managed Cilium as the policy engine, and exposes connection-level allow/deny logging that you can actually query. This guide is about running that plane in production: building a default-deny baseline, controlling egress by FQDN and CIDR, proving every decision with logs, reaching for CiliumClusterwideNetworkPolicy when namespaced policy is not enough, and migrating a live cluster without dropping a single packet you meant to keep.

1. How Dataplane V2 replaces kube-proxy with eBPF and Cilium

In a stock GKE cluster, kube-proxy watches Services and Endpoints and writes iptables (or IPVS) rules so that a packet to a ClusterIP gets DNAT’d to a backend pod. Network policy, if you enabled it, was a second system (Calico) writing its own iptables chains. Two control loops, two rule sets, and rule evaluation cost that grows with the number of services and policies.

Dataplane V2 collapses this. As a packet hits a GKE node, eBPF programs attached in the kernel decide routing, load-balancing, and policy enforcement in one pass. There is no kube-proxy DaemonSet — service load-balancing is done in eBPF. Cilium’s agent (anetd / cilium pods in kube-system) programs those maps from Kubernetes objects. The practical wins:

Enforcement is always on. Kubernetes NetworkPolicy is enforced natively; you do not install or manage Calico.
Service routing scales by map lookup, not by walking iptables chains, so policy and service count stop being a latency tax.
You get identity-aware logging. Because Cilium assigns a security identity to every endpoint, allow/deny decisions can be logged with pod and namespace context — the thing iptables never gave you.

Mental model: Dataplane V2 is managed Cilium. You get Kubernetes NetworkPolicy, GKE-specific CRDs (FQDNNetworkPolicy, NetworkLogging), and — where supported — a subset of native Cilium CRDs. You do not get a free-for-all Cilium install; Hubble UI, Cilium-managed Ingress, and arbitrary Cilium versions are Google’s to manage, not yours.

Enable it at create time (it is the default for Autopilot, and required there):

gcloud container clusters create prod-apps \
  --project my-prod-project \
  --region us-central1 \
  --enable-dataplane-v2 \
  --enable-ip-alias \
  --release-channel regular

Confirm the dataplane and the absence of kube-proxy:

gcloud container clusters describe prod-apps \
  --region us-central1 \
  --format="value(networkConfig.datapathProvider)"
# Expect: ADVANCED_DATAPATH

kubectl get ds -n kube-system | grep -E 'kube-proxy|anetd'
# anetd present; kube-proxy absent

2. Default-deny baselines and namespace-scoped policies

A network policy plane is only as good as its baseline. The default Kubernetes posture is allow-all; until a pod is selected by at least one policy, everything reaches it. The correct production stance is default-deny per namespace, then allow explicitly.

Apply a deny-all (ingress and egress) in every workload namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: payments
spec:
  podSelector: {}            # selects every pod in the namespace
  policyTypes:
    - Ingress
    - Egress

Egress deny-all will break DNS immediately, so the very next policy must re-allow DNS to kube-dns. Without this, name resolution fails and every outbound connection times out before it starts:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: payments
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

Now allow a specific path: the api pods accept ingress from frontend pods on 8080, and may egress to the db tier on 5432.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-allow
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: db
      ports:
        - protocol: TCP
          port: 5432

The kubernetes.io/metadata.name label is auto-applied by Kubernetes to every namespace, which makes it a reliable selector for cross-namespace rules — do not hand-label namespaces for this.

3. Egress control with FQDN-based and CIDR-based policies

Kubernetes NetworkPolicy egress only understands IPs and CIDRs via ipBlock. That is fine for stable infrastructure (a Cloud SQL private IP, an on-prem range) and useless for *.googleapis.com whose IPs churn. Dataplane V2 solves both.

CIDR egress with stock NetworkPolicy — allow the api tier to reach an on-prem CIDR but never the metadata server or RFC1918 by accident:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-egress-onprem
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 10.50.0.0/16
            except:
              - 10.50.7.0/24
      ports:
        - protocol: TCP
          port: 443

FQDN egress is a GKE-specific CRD, FQDNNetworkPolicy, enabled per-cluster. It works by snooping DNS responses (so it requires kube-dns or Cloud DNS — custom CoreDNS is not supported) and programming the resolved IPs into the eBPF policy maps. Turn it on:

gcloud container clusters update prod-apps \
  --region us-central1 \
  --enable-fqdn-network-policy

Then restrict the api pods to a specific external host plus a wildcard domain on 443. name is an exact FQDN; pattern accepts wildcards:

apiVersion: networking.gke.io/v1alpha1
kind: FQDNNetworkPolicy
metadata:
  name: api-egress-fqdn
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: api
  egress:
    - matches:
        - name: "secure.payments-partner.com"
        - pattern: "*.googleapis.com"
      ports:
        - protocol: TCP
          port: 443

Requirements worth pinning: FQDNNetworkPolicy needs GKE 1.26.4-gke.500 / 1.27.1-gke.400 or later and a supported DNS provider, and it does not cover Windows node pools or Cloud Service Mesh sidecars. Because enforcement is DNS-driven, a pod that connects by raw IP (skipping DNS) is not matched by an FQDN rule — pair FQDN policies with a CIDR deny if that bypass matters to you.

4. Network policy logging and verifying allow/deny decisions

This is the capability that justifies Dataplane V2 on its own. Logging is configured by a single cluster-scoped CRD named NetworkLogging. There is exactly one object per cluster and its name must be default — it cannot be renamed.

apiVersion: networking.gke.io/v1alpha1
kind: NetworkLogging
metadata:
  name: default
spec:
  cluster:
    allow:
      log: true
      delegate: false
    deny:
      log: true
      delegate: false

delegate: false logs cluster-wide; set delegate: true to only log connections for namespaces explicitly annotated (policy.network.gke.io/enable-logging: "true"), which is how you keep log volume sane at scale. Apply it:

kubectl apply -f networklogging.yaml
kubectl get networklogging default -o yaml

Decisions land in Cloud Logging under the policy-action log on the k8s_node resource. Query allowed and denied connections for a workload:

gcloud logging read \
  --project my-prod-project \
  'resource.type="k8s_node"
   resource.labels.cluster_name="prod-apps"
   logName="projects/my-prod-project/logs/policy-action"
   jsonPayload.connection.dest_port=5432' \
  --limit 20 --freshness 1h

A denied entry carries jsonPayload.disposition="deny" plus source/destination pod, namespace, and the policy_ref that made (or failed to make) the decision. To isolate drops in Logs Explorer:

resource.type="k8s_node"
resource.labels.cluster_name="prod-apps"
logName="projects/my-prod-project/logs/policy-action"
jsonPayload.disposition="deny"

This is your loop for safe rollout: enable logging first, watch deny entries for traffic you actually need, add allow rules, and only then tighten. You never have to guess what a policy will break.

5. Cluster-wide policies and tiering with Cilium CRDs

Namespaced NetworkPolicy cannot express “no pod in this cluster may ever reach the GCE metadata server” — you would have to copy it into every namespace and trust nobody forgets. Dataplane V2 supports CiliumClusterwideNetworkPolicy (CCNP), a cluster-scoped CRD, for exactly these org-wide guardrails. Enable it:

gcloud container clusters update prod-apps \
  --region us-central1 \
  --enable-cilium-clusterwide-network-policy

A canonical guardrail — block the link-local metadata endpoint cluster-wide while still allowing it through Workload Identity’s expected path is a common pattern, but the simplest universal rule is a clusterwide ingress contract. This example lets every role=backend endpoint accept ingress on 80 only from role=frontend, regardless of namespace:

apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: l4-ingress-backend
spec:
  endpointSelector:
    matchLabels:
      role: backend
  ingress:
    - fromEndpoints:
        - matchLabels:
            role: frontend
      toPorts:
        - ports:
            - port: "80"
              protocol: TCP

Version floor: CCNP needs gcloud 465.0.0+ and GKE 1.28.6-gke.1095000 / 1.29.1-gke.1016000 or later. Treat CCNP as the platform team’s layer — broad baselines and non-negotiable denies — and leave per-app allows to namespaced NetworkPolicy owned by app teams. GKE’s managed Cilium does not expose every upstream Cilium feature, so validate any CRD field against the GKE docs before depending on it; L7 (HTTP) policy in particular is not part of the managed surface.

6. Interactions with Gateway API, Services, and internal load balancers

Two failure modes bite teams here. First, load balancer health checks. A default-deny ingress policy will silently drop Google’s health-check probes, the LB marks backends unhealthy, and you get a 502 with no obvious cause. You must allow the GKE health-check ranges. With container-native load balancing (NEGs), probes originate from Google’s infrastructure ranges; allow them explicitly:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-gclb-health-checks
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - ipBlock:
            cidr: 35.191.0.0/16
        - ipBlock:
            cidr: 130.211.0.0/22
      ports:
        - protocol: TCP
          port: 8080

Second, client identity through the data path. With the Gateway API and container-native LBs, traffic from an external or internal Application Load Balancer arrives at the pod from those Google ranges, not from a node IP — so ingress policies must select on the LB CIDRs above, not on pod selectors, for north-south traffic. East-west pod-to-pod and pod-to-ClusterIP traffic keeps Cilium identity, so podSelector works there. An internal passthrough LB (Service type: LoadBalancer, internal) preserves the original client IP, which means your ipBlock rules should reflect the actual on-prem or VPC client ranges, not the LB.

7. Performance characteristics and known limitations

The eBPF datapath removes the iptables-chain tax: service and policy lookups are O(1) map operations, so throughput and tail latency stay flat as you scale services and policies — the opposite of kube-proxy + Calico, where the rule count is a linear cost. Direct Server Return and eBPF-based load balancing also cut per-connection overhead.

What to plan around:

NetworkPolicy object scaling. Google publishes a practical ceiling — historically on the order of a few thousand policies and a low-hundreds count of distinct pod-selector label combinations per cluster. Past that, control-plane programming latency grows. Consolidate policies; do not generate one per pod.
No L7 policy in the managed surface. Use NetworkPolicy, FQDNNetworkPolicy, and CCNP for L3/L4. HTTP-method/path enforcement belongs in a service mesh, not here.
FQDN policy is DNS-bound. Raw-IP egress bypasses it; TTL churn means a brief window after a record changes.
No SSH / no custom CNI. You cannot swap the dataplane after creation by hand, and on Autopilot you cannot run privileged host-network DaemonSets that some third-party CNIs expect.
Logging volume. Cluster-wide allow logging on a busy cluster is expensive; use delegate: true and annotate only the namespaces under investigation.

Enterprise scenario

A fintech platform team ran a regional GKE Standard cluster hosting a PCI-scoped payments namespace alongside a dozen non-regulated services. Their auditor’s finding was blunt: they could not prove that payment pods only egressed to the card-network partner and Google APIs, and nothing else. They had Calico NetworkPolicy, but Calico gave them no per-connection allow/deny evidence and no FQDN control — the partner published a hostname, not a stable CIDR, so the team had been allow-listing a /16 that was far wider than the partner actually used.

The constraint: they could not take a maintenance window long enough to recreate the cluster, and they could not risk dropping live settlement traffic. The fix was a staged migration to Dataplane V2 (covered in the next section), but the audit-closing piece was the policy design. They enabled FQDNNetworkPolicy, replaced the /16 allow with an exact-host rule, and turned on NetworkLogging scoped to payments via delegate. Two CRDs closed the finding:

apiVersion: networking.gke.io/v1alpha1
kind: FQDNNetworkPolicy
metadata:
  name: payments-partner-egress
  namespace: payments
spec:
  podSelector:
    matchLabels:
      pci-scope: "true"
  egress:
    - matches:
        - name: "settle.cardnetwork-partner.com"
        - pattern: "*.googleapis.com"
      ports:
        - protocol: TCP
          port: 443
---
apiVersion: networking.gke.io/v1alpha1
kind: NetworkLogging
metadata:
  name: default
spec:
  cluster:
    allow:
      log: true
      delegate: true        # only annotated namespaces (payments)
    deny:
      log: true
      delegate: false        # log all denies cluster-wide

They annotated the payments namespace, exported the policy-action logs to a BigQuery sink, and handed the auditor a query that returned every payment-pod egress with its destination FQDN and disposition. The /16 shrank to one hostname, and “we believe it is restricted” became “here is the connection log.”

Verify

Run this sequence after applying policies — it confirms the dataplane, the deny baseline, an FQDN allow, and that logging is recording decisions.

# 1. Confirm Dataplane V2 is active and kube-proxy is gone
gcloud container clusters describe prod-apps --region us-central1 \
  --format="value(networkConfig.datapathProvider)"   # ADVANCED_DATAPATH
kubectl get pods -n kube-system -l k8s-app=cilium      # anetd/cilium Running

# 2. Default-deny works: this should TIME OUT (no allow rule yet)
kubectl run probe --rm -it --image=curlimages/curl -n payments -- \
  curl -m 5 http://api.payments.svc.cluster.local:8080

# 3. FQDN allow works: this should SUCCEED for an allowed host
kubectl run probe --rm -it --image=curlimages/curl -n payments \
  --labels="app=api" -- curl -sS -m 5 https://www.googleapis.com -o /dev/null -w "%{http_code}\n"

# 4. The deny was logged
gcloud logging read --project my-prod-project \
  'logName="projects/my-prod-project/logs/policy-action"
   resource.labels.cluster_name="prod-apps"
   jsonPayload.disposition="deny"' \
  --limit 5 --freshness 10m

If step 2 returns a response, your deny baseline is not selecting that pod (check policyTypes). If step 3 hangs, your FQDNNetworkPolicy either is not enabled at the cluster level or the pod is not matched by podSelector. If step 4 is empty, NetworkLogging/default is missing or deny.log is false.

8. Migrating an existing cluster and validating no regressions

You cannot flip an existing Standard cluster to Dataplane V2 in place — the migration recreates the dataplane and is disruptive — so treat it as a planned change, not a flag toggle.

Inventory current policy. Export every NetworkPolicy and confirm whether Calico is the enforcer (kubectl get pods -n kube-system | grep calico). Catalog egress that relies on wide CIDRs you intend to replace with FQDN rules.
Stand up a Dataplane V2 clone. Build a parallel cluster with --enable-dataplane-v2 --enable-fqdn-network-policy, apply the same NetworkPolicy set, and run synthetic and canary traffic.
Enable logging in audit mode first. Apply NetworkLogging/default with deny.log: true before tightening. Watch policy-action for legitimate traffic being denied; every such entry is a missing allow rule.
Reconcile, then enforce. Add the allow and FQDN policies the logs revealed, re-test, and confirm zero unexpected denies over a representative window.
Cut over by shifting workloads, not by mutating the old cluster — drain and redeploy onto the new cluster behind the same LB/Gateway, watch error rates, and keep the old cluster as instant rollback until the deny log is clean.

The non-negotiable rule: log before you enforce. Dataplane V2’s allow/deny logging exists precisely so a network-policy migration is evidence-driven, not a leap of faith. If your policy-action deny stream is empty for traffic you expect to keep, you are ready to cut over.

GKE Dataplane V2: Cilium-Based Network Policy and Observability

1. How Dataplane V2 replaces kube-proxy with eBPF and Cilium

2. Default-deny baselines and namespace-scoped policies

3. Egress control with FQDN-based and CIDR-based policies

4. Network policy logging and verifying allow/deny decisions

5. Cluster-wide policies and tiering with Cilium CRDs

6. Interactions with Gateway API, Services, and internal load balancers

7. Performance characteristics and known limitations

Enterprise scenario

Verify

8. Migrating an existing cluster and validating no regressions

Checklist

Written by Vinod

Comments

Keep Reading

BigQuery Fine-Grained Security: Column-Level, Row-Level, and Data Masking

Cloud DNS at Scale: Private Zones, Peering, Forwarding, and Response Policies

Event-Driven Architecture with Cloud Functions 2nd Gen and Eventarc