Containerization Networking

Kubernetes Networking Internals, In Depth: The Network Model, CNI, IPAM & the Datapath

Most Kubernetes users treat the network as magic. They create a Pod, it gets an IP, it can talk to every other Pod in the cluster, a Service name resolves and load-balances — and they never ask who actually does any of that. The answer is that Kubernetes itself does almost none of it. The control plane assigns no Pod IPs, builds no routes, encapsulates no packets, and programs no firewall rules. Instead, Kubernetes defines a small set of rules the network must obey and a thin plugin interface — the CNI — and then delegates the entire job to a piece of software you choose and install yourself: Calico, Cilium, Flannel, an AWS/Azure/GCP cloud CNI, or one of a dozen others.

That delegation is why “Kubernetes networking” feels both simple and bafflingly inconsistent. The model is identical everywhere — every Pod is a first-class host on a flat network — but the implementation underneath can be a VXLAN overlay on one cluster, BGP-advertised routes on another, and eBPF programs attached to kernel hooks on a third, with wildly different performance, observability and failure modes. To operate clusters in production, pass the CKA/CKS, or hold your own in a senior interview, you have to be able to drop below the abstraction and explain the full path of a packet: how the kubelet wires a Pod’s network namespace, who allocates the IP, how a packet crosses nodes, how a NetworkPolicy actually blocks traffic in the kernel, and what changes when you replace kube-proxy with eBPF.

This lesson takes the whole stack apart. We cover the Kubernetes network model and its non-negotiable rules; the CNI specification in detail (the ADD/DEL contract and the exact kubelet→CRI→CNI call chain that wires a Pod); the three plugin families (overlay, L3/BGP, eBPF) with a hard comparison; IPAM; how NetworkPolicy is enforced at the datapath; kube-proxy (iptables and IPVS) versus the eBPF kube-proxy replacement; and intra-cluster DNS. It is deliberately long: by the end you should be able to debug a broken CNI from first principles and reason about every packet in your cluster.

Learning objectives

By the end of this lesson you can:

Prerequisites & where this fits

You need a working local cluster and kubectl comfort, plus familiarity with Linux networking primitives (network namespaces, veth pairs, routing tables, iptables) at a basic level — we define them as we go, but prior exposure helps. You should already understand Services end to end; this lesson is the layer underneath them. If you have not done it, read Kubernetes Services & Networking, In Depth: ClusterIP, NodePort, LoadBalancer, Headless & DNS first — it covers the Service object, EndpointSlices, kube-proxy and CoreDNS from the top, where this lesson covers the datapath and CNI internals from the bottom. It also helps to have met Pods, probes and lifecycle. This is Lesson 4 of the Kubernetes Zero-to-Hero “deepening” (Intermediate/Advanced) track, in the Networking module, and it is the foundation that the dedicated Cilium, eBPF & Hubble lesson builds on.

Core concepts: the Kubernetes network model

Everything starts from a model — a contract that any conforming cluster network must satisfy. Kubernetes does not prescribe how to implement it; it prescribes the behaviour. There are four rules:

  1. Every Pod gets its own unique, routable IP address from a cluster-wide Pod CIDR, distinct from the Service CIDR and from the node IPs. A Pod is addressable like a tiny VM.
  2. Every Pod can communicate with every other Pod on any node directly, without NAT. The source Pod sees the destination’s real Pod IP, and the destination sees the source’s real Pod IP. This is the flat network.
  3. Every node can communicate with every Pod (and agents on a node — the kubelet, kube-proxy — can reach Pods on that node), without NAT.
  4. The IP a Pod sees for itself is the same IP others use to reach it — no address translation sits in the middle of Pod-to-Pod traffic.

This “IP-per-Pod, flat, NAT-free” model is the single most important idea in Kubernetes networking. It is deliberately the opposite of plain Docker’s default bridge, where containers share the host IP and you publish ports with NAT (-p 8080:80). In Kubernetes there are no port-mapping games: a container listening on :8080 is reachable at podIP:8080 from anywhere in the cluster. Two Pods can both listen on :8080 because they have different IPs. This is why a Pod is a network identity, not just a process group.

Why “no NAT” matters so much

NAT (Network Address Translation) rewrites source or destination addresses as packets cross a boundary. It is the source of endless distributed-systems pain: it hides the real client IP (breaking audit logs, IP allow-lists and rate limiting), it breaks protocols that embed IPs in their payload, and it makes connection tracking a stateful bottleneck. By forbidding NAT for Pod-to-Pod traffic, Kubernetes makes the cluster behave like one big flat L3 network where every workload has a real, stable-for-its-lifetime identity. (NAT still appears at the edges — for egress to the internet via SNAT/masquerade, and sometimes for Service traffic via kube-proxy — but never between Pods.)

The three IP ranges you must keep straight

Range What it addresses Who allocates from it Typical example
Pod CIDR Individual Pod IPs The CNI plugin (via IPAM) 10.244.0.0/16, sliced into a /24 per node
Service CIDR Virtual ClusterIPs The API server (--service-cluster-ip-range) 10.96.0.0/12
Node IPs The physical/VM hosts Your infra (VPC/DHCP) 192.168.0.0/24

The Pod CIDR and Service CIDR are virtual/internal and must not overlap with each other or with the node network. A Service’s ClusterIP is not a real interface anywhere — it is a target that kube-proxy (or eBPF) DNATs onto a real Pod IP from the Pod CIDR. Keep these three straight and 80% of “I can’t reach my Service” confusion evaporates.

The four communication paths

Every request in a cluster takes one of four paths, and a different mechanism handles each:

Path Mechanism that handles it
Container ↔ container in the same Pod localhost — they share one network namespace and one IP
Pod ↔ Pod (same or different node) The flat CNI network — direct, by Pod IP, no NAT
Pod ↔ Service kube-proxy / eBPF DNATs the ClusterIP to a backing Pod IP; the name is resolved by CoreDNS
External ↔ Service NodePort / LoadBalancer / Ingress → node → kube-proxy/eBPF → Pod

The CNI owns rows 1–2 (the flat network itself). kube-proxy or eBPF owns rows 3–4 (turning virtual Service IPs into real Pod IPs). Holding these two responsibilities apart is the key to debugging: “Pod can’t reach Pod” is a CNI problem; “Pod can’t reach a Service VIP but can reach the Pod directly” is a kube-proxy/eBPF problem.

What the CNI is — and is not

CNI stands for Container Network Interface. It is a specification (a CNCF project) plus a set of reference plugins, and it is intentionally tiny. CNI defines exactly one thing: how a container runtime asks a network plugin to attach (or detach) a container to (from) a network. That is its entire scope.

What CNI is not:

Two files/locations matter on every node:

Location Contents Default path
CNI config dir The network configuration — a .conflist (or legacy .conf) JSON file describing which plugin(s) to run and with what settings /etc/cni/net.d/
CNI bin dir The plugin executables themselves (e.g. cilium-cni, calico, bridge, host-local, portmap, loopback) /opt/cni/bin/

When you “install a CNI”, you are dropping a config file into /etc/cni/net.d/ and binaries into /opt/cni/bin/ (the CNI’s DaemonSet does this for you via host mounts). If /etc/cni/net.d/ is empty, the kubelet reports the node as NotReady with NetworkPluginNotReady/cni plugin not initialized — a classic “fresh cluster, nodes never go Ready” symptom.

The CNI operations (the contract)

A CNI plugin must implement a handful of operations, selected via the CNI_COMMAND environment variable when the runtime executes the binary:

Operation When the runtime calls it What the plugin does
ADD A new sandbox (Pod) is being set up Create the Pod’s interface, allocate and assign an IP (via IPAM), set up routes; return the resulting Result (IPs, routes, DNS, interfaces) as JSON
DEL The sandbox is being torn down Release the IP back to IPAM and clean up the interface/routes
CHECK The runtime wants to verify a previously-added config is still in effect Validate the interface/IP/routes still match; error if drifted
GC Garbage-collect leaked resources (newer spec) Reclaim IPs/interfaces for sandboxes that no longer exist
VERSION Capability negotiation Report supported CNI spec versions

Alongside CNI_COMMAND, the runtime passes the request as environment variables plus JSON on stdin:

The plugin returns a Result on stdout: the assigned IPs (and which interface they belong to), routes, DNS info, and the interfaces created (including MACs). The runtime stores this and reports it up to the kubelet.

Config: conflists and chained plugins

The configuration in /etc/cni/net.d/ is usually a .conflist — an ordered list of plugins that are chained. CNI runs them in order on ADD (and in reverse on DEL), threading the Result of each as the prevResult input to the next. This is how responsibilities are composed. A very common chain:

{
  "cniVersion": "1.0.0",
  "name": "k8s-pod-network",
  "plugins": [
    {
      "type": "calico",          // main plugin: creates the veth, sets up the datapath
      "ipam": { "type": "calico-ipam" },
      "policy": { "type": "k8s" },
      "kubernetes": { "kubeconfig": "/etc/cni/net.d/calico-kubeconfig" }
    },
    { "type": "bandwidth", "capabilities": { "bandwidth": true } },  // traffic shaping
    { "type": "portmap",  "capabilities": { "portMappings": true } } // hostPort support
  ]
}

Key fields: cniVersion (the spec version), name (the network name), and the plugins array. Each plugin has a type (the binary name in /opt/cni/bin/) and a plugin-specific config. Note the nested ipam block — IPAM is itself a (sub-)plugin the main plugin delegates to. The capabilities map lets the runtime pass dynamic, per-Pod data (like a Pod’s hostPort mappings or a bandwidth annotation) into the chain. The two utility plugins above are extremely common: portmap implements hostPort (the only legitimate use of NAT-style port publishing in K8s, for the rare Pod that needs a host port), and bandwidth implements the kubernetes.io/ingress-bandwidth / egress-bandwidth annotations via Linux tc.

How a Pod actually gets wired: kubelet → CRI → CNI

This is the call path interviewers love and most people get fuzzy on. Here is the full chain, step by step, for a Pod being scheduled onto a node:

  1. The kubelet sees a Pod assigned to its node (via the API server watch) and decides to start it. The kubelet does not call CNI directly. It talks to the container runtime over the CRI (Container Runtime Interface) gRPC API.
  2. CRI: create the sandbox. The kubelet calls RunPodSandbox on the CRI runtime (containerd or CRI-O). The runtime creates the pause container (also called the sandbox or infra container) — a tiny, do-nothing container whose only job is to hold the Pod’s network namespace (and IPC/UTS namespaces) open. Every real container in the Pod will later join this namespace, which is why all containers in a Pod share one IP. The pause container is what keeps the namespace alive even while app containers restart.
  3. Runtime → CNI: ADD. With the sandbox’s network namespace created, the container runtime (not the kubelet) invokes the CNI plugin with CNI_COMMAND=ADD, passing CNI_NETNS = the sandbox’s netns path, CNI_IFNAME=eth0, the Pod’s identity in CNI_ARGS, and the .conflist on stdin. (containerd does this via its built-in CNI library; CRI-O similarly. This is the bridge between “Kubernetes-land” and “CNI-land”.)
  4. The CNI plugin does the real work inside the netns:
    • IPAM is consulted first — the plugin (or its ipam sub-plugin) allocates a free Pod IP from this node’s slice of the Pod CIDR.
    • A veth pair is created — a virtual Ethernet cable with two ends. One end (eth0) is moved into the Pod’s network namespace and configured with the allocated IP, a default route, and DNS. The other end stays in the host namespace, attached to the datapath (a Linux bridge, or directly to routing, or to an eBPF program, depending on the plugin).
    • Routes are programmed so that the flat-network rules hold: the host knows how to reach this Pod IP, and the Pod knows how to reach everything else (typically a default route via the host end of the veth).
  5. CNI returns the Result (the assigned IP, routes, DNS) to the runtime, which returns the sandbox status to the kubelet. The Pod now has its IP — this is the IP you see in kubectl get pod -o wide.
  6. CRI: create the app containers. The kubelet now calls CreateContainer/StartContainer for each container in the Pod. Crucially, each app container is created to join the sandbox’s existing network namespace — so they all share eth0 and the one Pod IP, and talk to each other over localhost.
  7. On Pod deletion, the reverse happens: the kubelet tears down containers, then the runtime calls CNI with CNI_COMMAND=DEL, which releases the IP back to IPAM and removes the veth/routes, and finally the sandbox is removed.

The mental model to lock in: kubelet speaks CRI to the runtime; the runtime speaks CNI to the plugin; the plugin wires the namespace. The kubelet and CNI never talk directly. (One nuance: the kubelet does read the CNI config dir to decide if the node’s network is ready — that NotReady check in the previous section — but the actual ADD/DEL invocation goes through the runtime.)

IPAM: who hands out the Pod IPs

IPAM (IP Address Management) is the sub-system that allocates and frees Pod IPs. It is invoked as part of CNI ADD (allocate) and DEL (free). There are two broad strategies, and which one your CNI uses has real operational consequences:

IPAM model How it works Pros Cons / gotchas
host-local (per-node ranges) Each node owns a fixed slice of the Pod CIDR (e.g. a /24). The host-local plugin tracks allocations in files under /var/lib/cni/networks/<net>/ on that node. Used by Flannel, Calico’s host-local mode, kubenet Dead simple, no coordination, very fast Fixed per-node capacity (a /24 = ~254 Pods/node, full or not); IPs can leak on ungraceful node crashes, leaving stale files that exhaust the range until cleaned; rebalancing is hard
Cluster-wide / CRD-backed A central allocator (often the CNI’s own controller using CRDs like Calico’s IPAMBlock/IPPool, or Cilium’s per-node CIDR with cluster coordination) hands out blocks/IPs dynamically Efficient use of address space, can borrow across nodes, supports multiple pools, IP pinning More moving parts; depends on the CNI’s datastore/agent being healthy
Cloud / VPC-native The node’s IPs come from the cloud’s address space; Pods get real VPC IPs attached to ENIs/secondary IPs (AWS VPC CNI, Azure CNI, GKE) Pods are first-class in the VPC (security groups, no overlay), great cloud integration Bounded by ENI/IP-per-instance limits (a real planning constraint on AWS); IP exhaustion at the VPC level

Two failure modes you must recognise:

The node’s Pod CIDR is visible on the Node object (kubectl get node <n> -o jsonpath='{.spec.podCIDR}') when the controller-manager’s --allocate-node-cidrs is on (the common setup with Flannel/Calico host-local). Cloud and cluster-wide CNIs may not populate it.

The plugin families: overlay, L3/BGP, and eBPF

All CNIs satisfy the same model, but they make the flat network real in fundamentally different ways. There are three families.

Family 1 — Overlay (encapsulation): VXLAN / Geneve / IP-in-IP

An overlay builds a virtual L2/L3 network on top of the existing node network by encapsulating each Pod packet inside another packet addressed node-to-node. The classic example is Flannel in VXLAN mode (Calico also offers VXLAN and IP-in-IP).

How it works: when a Pod on node A sends to a Pod on node B, the source packet (src = Pod-A-IP, dst = Pod-B-IP) is wrapped in an outer UDP/VXLAN header addressed src = node-A-IP, dst = node-B-IP. It travels across the physical network as ordinary node-to-node UDP, is decapsulated on node B, and the inner packet is delivered to the Pod. The underlying network never needs to know about Pod IPs at all — it only ever sees node-to-node traffic.

Family 2 — L3 routing / BGP (no encapsulation)

The routed approach makes Pod IPs real routable addresses on the underlay — no wrapping. The canonical example is Calico in BGP mode. Each node runs a BGP speaker that advertises “this node owns Pod CIDR 10.244.X.0/24” to its peers (other nodes and/or the physical routers / ToR switches). Now the network itself knows how to route a Pod IP to the right node, and packets travel unencapsulated — src = Pod IP, dst = Pod IP, the whole way.

Family 3 — eBPF (programmable kernel datapath)

eBPF lets you load small, verified programs into the Linux kernel that run at hook points (the network driver XDP/tc ingress/egress, socket operations, etc.) to make packet decisions in the kernel, without iptables. Cilium is the flagship eBPF CNI; Calico also has an eBPF dataplane. eBPF is a datapath technology, orthogonal to the encap question — Cilium can run as an overlay (VXLAN/Geneve) or in direct-routing/native mode — but what defines the family is how it makes decisions: programmable maps and kernel hooks instead of long iptables chains.

The comparison that matters

Dimension Overlay (Flannel VXLAN) L3/BGP (Calico) eBPF (Cilium)
Encapsulation Yes (VXLAN/Geneve) No (native routing) Optional (native or overlay)
Datapath Linux bridge + kernel routing Kernel routing + iptables eBPF programs (tc/XDP)
MTU impact ~50 B overhead, lower Pod MTU Full MTU Full MTU (native)
Throughput / CPU Encap cost Near-native Near-native, often best
Underlay requirement None (any IP network) BGP peering or L2 adjacency None (overlay) or routing (native)
kube-proxy Still needed Still needed (or eBPF mode) Replaced by eBPF
NetworkPolicy Needs a policy-capable CNI (Flannel alone = none) Yes (iptables/ipset) Yes, incl. L7 & DNS-aware
Observability Basic Good (felix, flow logs) Excellent (Hubble)
Best for “Just make it work anywhere” Performance on a cooperative L3 fabric Scale, security, observability

A crucial gotcha that bites beginners: Flannel on its own does not implement NetworkPolicy — it only provides connectivity. If you kubectl apply a NetworkPolicy on a Flannel-only cluster, it is silently a no-op (nothing enforces it). You need a policy-capable CNI (Calico, Cilium) or the Calico-for-policy + Flannel-for-networking combo (“Canal”). “My NetworkPolicy isn’t blocking anything” is, nine times out of ten, “my CNI doesn’t enforce policy.”

NetworkPolicy enforcement: where the firewall actually lives

A NetworkPolicy is a Kubernetes object describing allowed ingress/egress for a set of Pods (by label selector). But Kubernetes does not enforce it — it has no datapath. The CNI plugin enforces it, which is why an enforcing CNI is mandatory for policies to do anything. How it is enforced depends on the plugin:

Enforcement engine How a policy becomes a packet decision Scale behaviour
iptables + ipset (Calico Felix, kube-router, etc.) The agent watches Pods + policies and programs iptables rules keyed on ipsets (kernel hash sets of Pod IPs) so rules don’t grow per-IP. A packet to/from a selected Pod traverses these chains; no matching allow → drop Good with ipsets; without them, rule count explodes
eBPF maps (Cilium) The agent assigns each set of identical-label Pods a numeric security identity, stores policy as eBPF map lookups keyed on (source identity → dest identity → port), and decisions happen in-kernel at the tc/XDP hook. Supports L3/L4, L7 (HTTP/gRPC/Kafka), and DNS-based rules O(1) map lookups; scales to large clusters; far richer (L7)

The model is default-allow until a policy selects a Pod: the instant any NetworkPolicy selects a Pod for a direction (ingress or egress), that Pod becomes default-deny for that direction and only the listed traffic is permitted. The CNI agent (Calico’s Felix, Cilium’s agent) is the controller that translates NetworkPolicy (and CRDs like CiliumNetworkPolicy/GlobalNetworkPolicy) into the kernel rules. Two consequences: policy enforcement is only as healthy as that agent DaemonSet (if it crashes, policy can go stale), and identity-based enforcement (Cilium) survives Pod IP churn better than IP-based enforcement, because identities are tied to labels, not addresses. The dedicated Cilium, eBPF & Hubble lesson goes deep on identities, L7 policy and flow observability.

kube-proxy versus the eBPF kube-proxy replacement

The CNI gives you Pod-to-Pod. But a Service ClusterIP is a virtual IP that exists on no interface — something must rewrite “packet to 10.96.0.10:80” into “packet to a real backing Pod IP”. Classically that something is kube-proxy, a per-node DaemonSet that watches Services + EndpointSlices and programs the kernel. It has three datapath modes:

Mode Mechanism Behaviour at scale Notes
iptables (default) A tree of iptables chains; each Service → a chain that DNATs to a backing Pod, selecting endpoints with a statistic/probability match Rule evaluation is roughly O(n) in Services×endpoints; thousands of Services → large rulesets, slower updates, latency on programming changes Ubiquitous, well understood, “good enough” for most clusters
IPVS Uses the kernel’s IP Virtual Server (L4 load balancer) with real hashing; supports rr, lc, dh, sh, wrr, etc. scheduling O(1)-ish lookups, scales to many thousands of Services far better than iptables Needs IPVS kernel modules; still uses some iptables for edge cases (masquerade, NodePort)
nftables (newer, GA in recent K8s) Reimplements the proxy on nftables (the iptables successor) with set-based matching Much better update/lookup scaling than legacy iptables The intended long-term successor to the iptables mode

In all kube-proxy modes the flow is the same shape: a Pod sends to the ClusterIP, the kernel on that Pod’s own node DNATs the destination to a chosen endpoint Pod IP (and SNATs/masquerades if the endpoint is on another node and externalTrafficPolicy/internal policy requires it), and the packet then rides the CNI flat network to the endpoint. kube-proxy is not in the data path as a process — it only programs the rules; the kernel does the rewriting. (This is why killing kube-proxy doesn’t instantly break existing connections — the rules persist — but it stops new Services/endpoints from being reflected.)

The eBPF replacement

Cilium (and Calico’s eBPF mode) can disable kube-proxy entirely and implement Service load-balancing as eBPF programs with hash-map lookups attached at the socket and tc/XDP layers. Instead of walking iptables chains, a connection to a ClusterIP is resolved by an O(1) map lookup — and with socket-level load balancing, the DNAT can happen at connect() time so the packet is born addressed to the right backend, skipping per-packet NAT overhead entirely. NodePort/LoadBalancer can be handled at XDP (in the NIC driver, before the kernel network stack) for line-rate performance.

Aspect kube-proxy (iptables/IPVS) eBPF replacement (Cilium)
Service lookup iptables chain walk (O(n)) / IPVS hash eBPF hash-map (O(1))
Where DNAT happens Per-packet in netfilter At connect() (socket LB) and/or tc/XDP
Scale (1000s of Services) iptables degrades; IPVS okay Flat, excellent
Extra capabilities L4 only Integrated with policy, observability (Hubble), DSR, Maglev hashing
Requirement Works on old kernels Modern kernel for full replacement

To make this concrete with a single packet: a Pod calls web.default.svc.cluster.local:80. (1) CoreDNS resolves the name to the Service’s ClusterIP. (2) The Pod sends to ClusterIP:80. (3) kube-proxy’s kernel rules or Cilium’s eBPF DNAT that to a ready backing Pod IP chosen from the EndpointSlice. (4) The packet rides the CNI flat network to that Pod’s node and into its netns. Four layers — DNS, Service-VIP translation, the CNI fabric, the Pod namespace — each owned by a different component.

Intra-cluster DNS: CoreDNS on the flat network

DNS is the last piece that makes the flat network usable by humans and apps. CoreDNS runs as a Deployment (fronted by a ClusterIP Service, conventionally 10.96.0.10, named kube-dns for backward compatibility) and is the cluster’s resolver. Every Pod’s /etc/resolv.conf is populated by the kubelet (per the Pod’s dnsPolicy, default ClusterFirst) to point at that DNS Service IP, with a search list and ndots:5:

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

CoreDNS watches the API server and answers:

The ndots:5 rule means any name with fewer than 5 dots is first tried with each search suffix appended before being tried as-is — which is why web resolves to web.default.svc.cluster.local from inside the default namespace, but also why a careless external lookup like api.github.com (3 dots) wastes several queries on the search list first. The point for this lesson: CoreDNS rides entirely on the flat Pod network and the Service abstraction — it is “just another Pod behind a ClusterIP”. If your CNI is broken, DNS breaks too, and the symptom (could not resolve host) often looks like a DNS problem when the root cause is the underlying network or kube-proxy. Always test connectivity by raw IP before blaming DNS.

Kubernetes networking internals & CNI

The diagram traces the full stack: the kubelet calling the CRI to create a pause/sandbox container, the runtime invoking the CNI plugin to allocate an IP (IPAM) and wire a veth into the Pod’s netns, the three plugin families making the flat network real (overlay encap vs BGP routes vs eBPF), and a Service request being DNAT’d by kube-proxy or eBPF onto that flat network — with CoreDNS resolving the name alongside.

Hands-on lab

This lab runs free on a local cluster. We will inspect the real CNI configuration and binaries, trace a Pod’s IP and its veth/namespace, prove the flat network, observe IPAM, and watch how a NetworkPolicy changes behaviour. We use kind, which uses a CNI under the hood and lets us shell into nodes.

Note: kind ships with a simple default CNI (kindnet). For the NetworkPolicy step we deploy Calico so policy is actually enforced — demonstrating the “Flannel/kindnet alone doesn’t enforce policy” point from above.

1. A multi-node cluster

cat <<'EOF' | kind create cluster --name cni-lab --config -
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  podSubnet: "10.244.0.0/16"     # the Pod CIDR
  serviceSubnet: "10.96.0.0/16"  # the Service CIDR
nodes:
  - role: control-plane
  - role: worker
  - role: worker
EOF

kubectl get nodes -o wide
# Note each node's INTERNAL-IP (node network) — distinct from Pod IPs.

2. See the CNI config and binaries on a node

# Shell into a node (kind nodes are containers)
docker exec -it cni-lab-worker bash

# The CNI config that the runtime reads:
ls -l /etc/cni/net.d/
cat /etc/cni/net.d/*.conflist 2>/dev/null || cat /etc/cni/net.d/*.conf
# -> shows cniVersion, name, the plugin chain (type, ipam, etc.)

# The CNI plugin binaries the runtime invokes:
ls -l /opt/cni/bin/
# -> bridge, host-local, portmap, loopback, ... (and the CNI's own binary)
exit

You have just seen the two things that are a CNI install: a config file and some executables.

3. Deploy Pods and inspect IPs, namespaces and veths

kubectl create deployment web --image=nginx:1.27 --replicas=3
kubectl set resources deployment web --requests=cpu=50m,memory=32Mi
kubectl rollout status deployment/web
kubectl get pods -l app=web -o wide
# Each Pod has a unique IP from 10.244.0.0/16, spread across nodes.

# Show the node's Pod CIDR slice (IPAM range for that node):
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'

Now look at the Linux plumbing on a node that hosts a web Pod:

POD=$(kubectl get pod -l app=web -o jsonpath='{.items[0].metadata.name}')
NODE=$(kubectl get pod "$POD" -o jsonpath='{.spec.nodeName}')
echo "$POD is on $NODE"

docker exec -it "$NODE" bash
# The host side of the veth pairs (one per Pod on this node):
ip -d link show | grep -A1 veth
# The host routes that send Pod-CIDR traffic to the right place:
ip route | grep 10.244
exit

You are seeing the veth pairs (the host end of each Pod’s virtual cable) and the routes the CNI programmed — the concrete realisation of the flat network.

4. Prove the flat, NAT-free network

# From one Pod, curl another Pod by its raw Pod IP (no Service involved):
PODA=$(kubectl get pod -l app=web -o jsonpath='{.items[0].metadata.name}')
IPB=$(kubectl get pod -l app=web -o jsonpath='{.items[1].status.podIP}')
NODEA=$(kubectl get pod "$PODA" -o jsonpath='{.spec.nodeName}')
NODEB=$(kubectl get pod -l app=web -o jsonpath='{.items[1].spec.nodeName}')
echo "Pod A on $NODEA -> Pod B IP $IPB on $NODEB"

kubectl exec "$PODA" -- curl -s -o /dev/null -w "%{http_code}\n" "http://$IPB"
# -> 200, even across nodes, addressing Pod B by its real IP. That is the flat network.

# Confirm the destination sees the *source Pod's real IP* (no NAT) — check nginx logs:
kubectl logs "$(kubectl get pod -l app=web -o jsonpath='{.items[1].metadata.name}')" | tail -1
# The client IP in the log is Pod A's Pod IP, not a node IP.

5. Watch IPAM allocate and release

# Scale up and watch new Pods get fresh IPs from the CIDR:
kubectl scale deployment web --replicas=6
kubectl get pods -l app=web -o wide -w   # Ctrl-C after they are Running
# Scale down: those IPs are released back to IPAM (CNI DEL).
kubectl scale deployment web --replicas=2

6. NetworkPolicy enforcement (with a policy-capable CNI)

# A client Pod and a default-deny test:
kubectl run client --image=curlimages/curl --restart=Never -- sleep 3600
kubectl expose deployment web --port=80   # ClusterIP Service "web"
kubectl wait --for=condition=Ready pod/client

# Baseline: client CAN reach the Service (no policy yet):
kubectl exec client -- curl -s -o /dev/null -w "%{http_code}\n" http://web

# Apply a default-deny ingress policy on the web Pods:
cat <<'EOF' | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: web-deny-ingress
spec:
  podSelector:
    matchLabels:
      app: web
  policyTypes: ["Ingress"]   # selects web Pods for Ingress -> default-deny ingress
EOF

# Now the client is blocked (times out) — IF the CNI enforces policy:
kubectl exec client -- curl -s --max-time 4 -o /dev/null -w "%{http_code}\n" http://web || echo "blocked (expected)"

If you ran this on a CNI that doesn’t enforce policy (plain Flannel/kindnet), the curl would still return 200 — proving the policy is a no-op without an enforcing datapath. On Calico/Cilium it times out.

7. See the Service VIP translation

# The Service has a ClusterIP that lives on no interface:
kubectl get svc web   # note the CLUSTER-IP, e.g. 10.96.x.y
# Inspect the kernel rules kube-proxy programmed for it (iptables mode):
docker exec cni-lab-worker bash -c "iptables-save -t nat | grep -i web | head"
# -> KUBE-SVC-* / KUBE-SEP-* chains that DNAT the ClusterIP to backing Pod IPs.

Validation

You should have observed: (a) a CNI config + binaries on disk; (b) unique Pod IPs from the Pod CIDR, with per-node CIDR slices; © veth pairs and host routes implementing the flat network; (d) cross-node Pod-to-Pod by raw IP with the real source IP preserved (no NAT); (e) IPs allocated/freed on scale; (f) a NetworkPolicy that blocks traffic only because the CNI enforces it; (g) the iptables NAT chains that turn a ClusterIP into a Pod IP.

Cleanup

kubectl delete networkpolicy web-deny-ingress
kubectl delete deployment web
kubectl delete svc web
kubectl delete pod client
kind delete cluster --name cni-lab

Cost note

Everything ran locally in Docker via kind — zero cloud cost. The only resource consumed is local CPU/RAM; deleting the kind cluster reclaims it all.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Nodes stuck NotReady, cni plugin not initialized / NetworkPluginNotReady No CNI installed, or its config absent from /etc/cni/net.d/ Install a CNI (apply its manifest); confirm a .conflist and binaries appear and the CNI DaemonSet is Running
Pods stuck ContainerCreating with failed to allocate IP / no IP addresses available IPAM exhaustion — node CIDR full or leaked allocations Enlarge the per-node Pod CIDR / cluster CIDR, reduce maxPods, or clean leaked host-local files under /var/lib/cni/networks/
NetworkPolicy doesn’t block anything CNI doesn’t enforce policy (plain Flannel/kindnet), or the policy agent (Felix/Cilium) is unhealthy Use a policy-capable CNI (Calico/Cilium/Canal); check the agent DaemonSet is Running on every node
Cross-node Pod traffic fails; same-node works Overlay/routing broken: wrong MTU, blocked VXLAN UDP (8472/4789), missing BGP peering, or a node firewall/security-group dropping it Verify CNI pods healthy on both nodes; check MTU; open the overlay port / BGP; check cloud SG/NACL allows node-to-node + Pod CIDR
Large responses hang, small ones work MTU mismatch (encap overhead not accounted for); PMTU black hole Lower the Pod MTU (e.g. 1450 for VXLAN), ensure consistent MTU end to end
Pod has no IP / kubectl get pod -o wide shows <none> long after scheduling CNI ADD failing (bad config, missing binary, IPAM datastore down) kubectl describe pod for the CNI error; check the CNI agent logs on that node; verify /opt/cni/bin has the plugin
Service unreachable by name but reachable by ClusterIP CoreDNS problem (Pods crashlooping, ndots/search misconfig, wrong dnsPolicy) kubectl -n kube-system get pods -l k8s-app=kube-dns; test with FQDN; check the Pod’s /etc/resolv.conf
Service unreachable by ClusterIP but Pod reachable by Pod IP kube-proxy problem (crashed, wrong mode, missing IPVS modules) or no ready endpoints Check kube-proxy DaemonSet; kubectl get endpointslices; verify endpoints exist and are Ready
Client IP shows as a node IP, not the real source SNAT/masquerade on the path (Service externalTrafficPolicy: Cluster, or egress masquerade) Use externalTrafficPolicy: Local where client IP matters; understand which hops SNAT

Best practices

Security notes

Interview & exam questions

1. Walk me through exactly how a Pod gets its IP — who calls whom? The kubelet talks to the container runtime over the CRI (RunPodSandbox), which creates the pause/sandbox container holding the Pod’s network namespace. The runtime then invokes the CNI plugin with CNI_COMMAND=ADD, passing the netns path. The CNI plugin calls IPAM to allocate an IP from the node’s Pod-CIDR slice, creates a veth pair (one end eth0 inside the netns with the IP, the other in the host), programs routes, and returns the Result. The kubelet never calls CNI directly.

2. State the Kubernetes network model’s rules. Every Pod gets a unique routable IP; every Pod can reach every other Pod on any node without NAT; every node can reach every Pod without NAT; and a Pod sees itself by the same IP others use to reach it. IP-per-Pod, flat, NAT-free.

3. What is the CNI, concretely — a daemon? A specification plus executable plugins. The core contract is binaries in /opt/cni/bin/ that the runtime invokes (JSON in/out) per the config in /etc/cni/net.d/, implementing ADD/DEL/CHECK/GC. (Modern CNIs also run an agent DaemonSet for policy/IPAM, but the invocation itself is a short-lived binary call.)

4. Overlay vs BGP — when do you pick each? Overlay (VXLAN/Geneve) encapsulates Pod packets node-to-node — works on any network, at the cost of header overhead and lower MTU. BGP/L3 routing advertises Pod CIDRs so packets travel unencapsulated at full MTU and native speed — but requires the underlay to cooperate (BGP peering or L2 adjacency). Overlay for networks you don’t control; BGP for a cooperative L3 fabric.

5. Why is eBPF a different kind of CNI? eBPF is a datapath technology: verified programs in kernel hooks (tc/XDP/socket) make packet decisions via maps, replacing iptables. Orthogonal to encapsulation (Cilium can overlay or route). It enables kube-proxy replacement (O(1) Service lookups, socket-level LB), L7/identity-based NetworkPolicy, and deep observability (Hubble).

6. Who enforces NetworkPolicy? The CNI plugin, not Kubernetes. Kubernetes has no datapath. Calico/Cilium translate policies into iptables+ipset rules or eBPF maps. With a non-enforcing CNI (plain Flannel), NetworkPolicy is a silent no-op.

7. iptables vs IPVS kube-proxy — what’s the difference and why does it matter? Both turn a ClusterIP into a Pod IP via kernel rules. iptables mode walks chains roughly O(n) in Services×endpoints — fine small, slow at thousands of Services. IPVS uses the kernel’s L4 load balancer with real hashing for O(1)-ish lookups and far better scaling, plus multiple scheduling algorithms. (nftables mode is the modern successor; eBPF replaces kube-proxy entirely.)

8. What is IPAM and how does it fail? IP Address Management allocates/frees Pod IPs during CNI ADD/DEL. host-local gives each node a fixed CIDR slice (simple, but fixed capacity and can leak on crashes); cluster-wide/CRD allocators borrow across nodes; VPC-native uses real cloud IPs (bounded by ENI limits). Failure = ContainerCreating with “no IP addresses available” when a range is exhausted.

9. What is the pause container and why does it exist? A tiny no-op container created per Pod that holds the network (and IPC/UTS) namespace open. App containers join its namespace, so they share one IP and talk over localhost. It keeps the namespace alive across app-container restarts.

10. A Pod can reach another Pod by IP but not by Service name. Where do you look? That isolates the problem to the Service/DNS layer, not the CNI flat network (which clearly works). Check CoreDNS (pods healthy? resolv.conf/ndots/search correct? FQDN works?) and, if even the ClusterIP fails, kube-proxy (mode, health) and whether the Service has ready endpoints.

11. Why does Kubernetes forbid NAT between Pods? To give every workload a real, stable identity on a flat L3 network — preserving the real source IP (for logs, allow-lists, rate limiting), avoiding NAT’s protocol breakage, and removing a stateful connection-tracking bottleneck between Pods. NAT survives only at the edges (egress, some Service paths).

12. How does encapsulation affect MTU, and what’s the classic bug? VXLAN adds ~50 bytes per packet, so the usable Pod MTU must drop (e.g. to ~1450). If it isn’t lowered (or PMTU discovery is blocked), large packets are dropped and you get the signature “small requests work, large responses hang” bug.

Quick check

  1. From which range is a Pod’s IP allocated, and who allocates it?
  2. Which component invokes the CNI plugin — the kubelet or the container runtime?
  3. Name the three CNI plugin families and their defining trait.
  4. True or false: applying a NetworkPolicy on a plain-Flannel cluster blocks traffic.
  5. What does kube-proxy do to a packet destined for a ClusterIP, and is kube-proxy in the data path?

Answers

  1. From the Pod CIDR, by the CNI plugin via IPAM (during ADD).
  2. The container runtime (containerd/CRI-O), which the kubelet drives over the CRI. The kubelet does not call CNI directly.
  3. Overlay (encapsulation — VXLAN/Geneve, e.g. Flannel), L3/BGP (native routing, e.g. Calico), eBPF (programmable kernel datapath, e.g. Cilium).
  4. False — plain Flannel doesn’t enforce NetworkPolicy, so it’s a silent no-op. You need a policy-capable CNI.
  5. The kernel DNATs the ClusterIP to a chosen ready backing Pod IP (selected from EndpointSlices), then the packet rides the flat CNI network. kube-proxy only programs the rules; it is not in the data path as a process.

Exercise

On a kind cluster, deliberately reproduce two failure modes and explain each from first principles:

  1. IPAM exhaustion. Create a kind cluster with a deliberately tiny per-node Pod CIDR (e.g. set a /27 node-cidr-mask via kubeadm config, or schedule far more Pods than the slice allows onto one node with a nodeSelector). Schedule Pods until new ones hang in ContainerCreating; capture the exact CNI error from kubectl describe pod; then show the fix by reducing the Pod count.
  2. Policy is a no-op without enforcement. On the default kindnet cluster, apply a default-deny ingress NetworkPolicy to a web Deployment and prove (with curl from a client Pod) that traffic is still allowed. Then install Calico, re-apply, and prove it is now blocked.

Write up: for each, the symptom, the root cause in the stack (which layer/component), and the precise fix. This is exactly the reasoning a CKA/CKS scenario tests.

Certification mapping

Glossary

Next steps

You can now reason about every packet in a cluster from the wire up: the model it must satisfy, the CNI handshake that wires each Pod, how IPs are allocated, how the three plugin families make the flat network real, and how Services and policy are enforced in the datapath. Next, learn how to see what your cluster is doing — metrics, dashboards and alerting: Kubernetes Monitoring, In Depth: metrics-server, Prometheus, Grafana & Alerting. For the deepest dive on the most powerful datapath — eBPF identities, L7/DNS-aware policy and flow observability — go straight to Cilium, eBPF, NetworkPolicy & Hubble Observability. And to see the top-down view of the same Service/DNS machinery this lesson explained from the bottom, revisit Kubernetes Services & Networking.

KubernetesCNINetworkingCiliumCalicoeBPF
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading