Most Kubernetes users treat the network as magic. They create a Pod, it gets an IP, it can talk to every other Pod in the cluster, a Service name resolves and load-balances — and they never ask who actually does any of that. The answer is that Kubernetes itself does almost none of it. The control plane assigns no Pod IPs, builds no routes, encapsulates no packets, and programs no firewall rules. Instead, Kubernetes defines a small set of rules the network must obey and a thin plugin interface — the CNI — and then delegates the entire job to a piece of software you choose and install yourself: Calico, Cilium, Flannel, an AWS/Azure/GCP cloud CNI, or one of a dozen others.
That delegation is why “Kubernetes networking” feels both simple and bafflingly inconsistent. The model is identical everywhere — every Pod is a first-class host on a flat network — but the implementation underneath can be a VXLAN overlay on one cluster, BGP-advertised routes on another, and eBPF programs attached to kernel hooks on a third, with wildly different performance, observability and failure modes. To operate clusters in production, pass the CKA/CKS, or hold your own in a senior interview, you have to be able to drop below the abstraction and explain the full path of a packet: how the kubelet wires a Pod’s network namespace, who allocates the IP, how a packet crosses nodes, how a NetworkPolicy actually blocks traffic in the kernel, and what changes when you replace kube-proxy with eBPF.
This lesson takes the whole stack apart. We cover the Kubernetes network model and its non-negotiable rules; the CNI specification in detail (the ADD/DEL contract and the exact kubelet→CRI→CNI call chain that wires a Pod); the three plugin families (overlay, L3/BGP, eBPF) with a hard comparison; IPAM; how NetworkPolicy is enforced at the datapath; kube-proxy (iptables and IPVS) versus the eBPF kube-proxy replacement; and intra-cluster DNS. It is deliberately long: by the end you should be able to debug a broken CNI from first principles and reason about every packet in your cluster.
Learning objectives
By the end of this lesson you can:
- State the Kubernetes network model’s four rules and the four communication paths it must guarantee, and explain why NAT-free Pod-to-Pod reachability is the load-bearing constraint.
- Describe the CNI specification precisely: the
ADD/DEL/CHECK/GCoperations, how configuration is structured (conflists, chained plugins), and the exact data passed in and out. - Trace the kubelet → CRI → CNI call path that creates a Pod’s network namespace, the pause/sandbox container, the veth pair, and the IP assignment.
- Compare the overlay (VXLAN/Geneve), L3 routing/BGP, and eBPF plugin families on encapsulation, performance, scalability, observability and operational cost, and pick the right one for a scenario.
- Explain IPAM — host-local versus cluster-wide/CRD-backed allocators, the Pod CIDR per node, and exhaustion failure modes.
- Explain how NetworkPolicy is translated into datapath enforcement (iptables/ipset versus eBPF maps), and why the CNI, not Kubernetes, enforces it.
- Contrast kube-proxy (iptables and IPVS modes) with the eBPF kube-proxy replacement on how a ClusterIP becomes a real packet, and on scale behaviour.
- Describe CoreDNS-based service discovery and how it rides on the flat Pod network.
Prerequisites & where this fits
You need a working local cluster and kubectl comfort, plus familiarity with Linux networking primitives (network namespaces, veth pairs, routing tables, iptables) at a basic level — we define them as we go, but prior exposure helps. You should already understand Services end to end; this lesson is the layer underneath them. If you have not done it, read Kubernetes Services & Networking, In Depth: ClusterIP, NodePort, LoadBalancer, Headless & DNS first — it covers the Service object, EndpointSlices, kube-proxy and CoreDNS from the top, where this lesson covers the datapath and CNI internals from the bottom. It also helps to have met Pods, probes and lifecycle. This is Lesson 4 of the Kubernetes Zero-to-Hero “deepening” (Intermediate/Advanced) track, in the Networking module, and it is the foundation that the dedicated Cilium, eBPF & Hubble lesson builds on.
Core concepts: the Kubernetes network model
Everything starts from a model — a contract that any conforming cluster network must satisfy. Kubernetes does not prescribe how to implement it; it prescribes the behaviour. There are four rules:
- Every Pod gets its own unique, routable IP address from a cluster-wide Pod CIDR, distinct from the Service CIDR and from the node IPs. A Pod is addressable like a tiny VM.
- Every Pod can communicate with every other Pod on any node directly, without NAT. The source Pod sees the destination’s real Pod IP, and the destination sees the source’s real Pod IP. This is the flat network.
- Every node can communicate with every Pod (and agents on a node — the kubelet, kube-proxy — can reach Pods on that node), without NAT.
- The IP a Pod sees for itself is the same IP others use to reach it — no address translation sits in the middle of Pod-to-Pod traffic.
This “IP-per-Pod, flat, NAT-free” model is the single most important idea in Kubernetes networking. It is deliberately the opposite of plain Docker’s default bridge, where containers share the host IP and you publish ports with NAT (-p 8080:80). In Kubernetes there are no port-mapping games: a container listening on :8080 is reachable at podIP:8080 from anywhere in the cluster. Two Pods can both listen on :8080 because they have different IPs. This is why a Pod is a network identity, not just a process group.
Why “no NAT” matters so much
NAT (Network Address Translation) rewrites source or destination addresses as packets cross a boundary. It is the source of endless distributed-systems pain: it hides the real client IP (breaking audit logs, IP allow-lists and rate limiting), it breaks protocols that embed IPs in their payload, and it makes connection tracking a stateful bottleneck. By forbidding NAT for Pod-to-Pod traffic, Kubernetes makes the cluster behave like one big flat L3 network where every workload has a real, stable-for-its-lifetime identity. (NAT still appears at the edges — for egress to the internet via SNAT/masquerade, and sometimes for Service traffic via kube-proxy — but never between Pods.)
The three IP ranges you must keep straight
| Range | What it addresses | Who allocates from it | Typical example |
|---|---|---|---|
| Pod CIDR | Individual Pod IPs | The CNI plugin (via IPAM) | 10.244.0.0/16, sliced into a /24 per node |
| Service CIDR | Virtual ClusterIPs | The API server (--service-cluster-ip-range) |
10.96.0.0/12 |
| Node IPs | The physical/VM hosts | Your infra (VPC/DHCP) | 192.168.0.0/24 |
The Pod CIDR and Service CIDR are virtual/internal and must not overlap with each other or with the node network. A Service’s ClusterIP is not a real interface anywhere — it is a target that kube-proxy (or eBPF) DNATs onto a real Pod IP from the Pod CIDR. Keep these three straight and 80% of “I can’t reach my Service” confusion evaporates.
The four communication paths
Every request in a cluster takes one of four paths, and a different mechanism handles each:
| Path | Mechanism that handles it |
|---|---|
| Container ↔ container in the same Pod | localhost — they share one network namespace and one IP |
| Pod ↔ Pod (same or different node) | The flat CNI network — direct, by Pod IP, no NAT |
| Pod ↔ Service | kube-proxy / eBPF DNATs the ClusterIP to a backing Pod IP; the name is resolved by CoreDNS |
| External ↔ Service | NodePort / LoadBalancer / Ingress → node → kube-proxy/eBPF → Pod |
The CNI owns rows 1–2 (the flat network itself). kube-proxy or eBPF owns rows 3–4 (turning virtual Service IPs into real Pod IPs). Holding these two responsibilities apart is the key to debugging: “Pod can’t reach Pod” is a CNI problem; “Pod can’t reach a Service VIP but can reach the Pod directly” is a kube-proxy/eBPF problem.
What the CNI is — and is not
CNI stands for Container Network Interface. It is a specification (a CNCF project) plus a set of reference plugins, and it is intentionally tiny. CNI defines exactly one thing: how a container runtime asks a network plugin to attach (or detach) a container to (from) a network. That is its entire scope.
What CNI is not:
- It is not a daemon or a controller. The core CNI contract is executables on disk that the runtime invokes like ordinary programs, passing JSON on stdin and getting JSON on stdout. (Modern CNIs like Cilium and Calico also run a long-lived agent DaemonSet for policy, IPAM coordination and eBPF management, but the CNI invocation itself is still a short-lived binary call.)
- It does not implement Services, DNS, or Ingress. Those are Kubernetes-layer concerns handled by kube-proxy/eBPF and CoreDNS.
- It does not define how to satisfy the network model — overlay vs BGP vs eBPF is entirely the plugin’s choice. CNI only defines the handshake.
Two files/locations matter on every node:
| Location | Contents | Default path |
|---|---|---|
| CNI config dir | The network configuration — a .conflist (or legacy .conf) JSON file describing which plugin(s) to run and with what settings |
/etc/cni/net.d/ |
| CNI bin dir | The plugin executables themselves (e.g. cilium-cni, calico, bridge, host-local, portmap, loopback) |
/opt/cni/bin/ |
When you “install a CNI”, you are dropping a config file into /etc/cni/net.d/ and binaries into /opt/cni/bin/ (the CNI’s DaemonSet does this for you via host mounts). If /etc/cni/net.d/ is empty, the kubelet reports the node as NotReady with NetworkPluginNotReady/cni plugin not initialized — a classic “fresh cluster, nodes never go Ready” symptom.
The CNI operations (the contract)
A CNI plugin must implement a handful of operations, selected via the CNI_COMMAND environment variable when the runtime executes the binary:
| Operation | When the runtime calls it | What the plugin does |
|---|---|---|
| ADD | A new sandbox (Pod) is being set up | Create the Pod’s interface, allocate and assign an IP (via IPAM), set up routes; return the resulting Result (IPs, routes, DNS, interfaces) as JSON |
| DEL | The sandbox is being torn down | Release the IP back to IPAM and clean up the interface/routes |
| CHECK | The runtime wants to verify a previously-added config is still in effect | Validate the interface/IP/routes still match; error if drifted |
| GC | Garbage-collect leaked resources (newer spec) | Reclaim IPs/interfaces for sandboxes that no longer exist |
| VERSION | Capability negotiation | Report supported CNI spec versions |
Alongside CNI_COMMAND, the runtime passes the request as environment variables plus JSON on stdin:
CNI_CONTAINERID— the sandbox ID.CNI_NETNS— the path to the Pod’s network namespace (e.g./var/run/netns/cni-…or/proc/<pid>/ns/net). This is the heart of it: the plugin’s job is to set up networking inside this namespace.CNI_IFNAME— the interface name to create inside the namespace (almost alwayseth0for the Pod).CNI_PATH— where to find other plugin binaries (for IPAM/chaining).CNI_ARGS— extra key/value args; Kubernetes passes the Pod name and namespace here (K8S_POD_NAMESPACE,K8S_POD_NAME,K8S_POD_INFRA_CONTAINER_ID), which is how the CNI plugin knows which Pod it is wiring (and can look up its labels for policy/IPAM decisions).- stdin — the network configuration JSON (the
.conflist, with the current plugin’s block selected).
The plugin returns a Result on stdout: the assigned IPs (and which interface they belong to), routes, DNS info, and the interfaces created (including MACs). The runtime stores this and reports it up to the kubelet.
Config: conflists and chained plugins
The configuration in /etc/cni/net.d/ is usually a .conflist — an ordered list of plugins that are chained. CNI runs them in order on ADD (and in reverse on DEL), threading the Result of each as the prevResult input to the next. This is how responsibilities are composed. A very common chain:
{
"cniVersion": "1.0.0",
"name": "k8s-pod-network",
"plugins": [
{
"type": "calico", // main plugin: creates the veth, sets up the datapath
"ipam": { "type": "calico-ipam" },
"policy": { "type": "k8s" },
"kubernetes": { "kubeconfig": "/etc/cni/net.d/calico-kubeconfig" }
},
{ "type": "bandwidth", "capabilities": { "bandwidth": true } }, // traffic shaping
{ "type": "portmap", "capabilities": { "portMappings": true } } // hostPort support
]
}
Key fields: cniVersion (the spec version), name (the network name), and the plugins array. Each plugin has a type (the binary name in /opt/cni/bin/) and a plugin-specific config. Note the nested ipam block — IPAM is itself a (sub-)plugin the main plugin delegates to. The capabilities map lets the runtime pass dynamic, per-Pod data (like a Pod’s hostPort mappings or a bandwidth annotation) into the chain. The two utility plugins above are extremely common: portmap implements hostPort (the only legitimate use of NAT-style port publishing in K8s, for the rare Pod that needs a host port), and bandwidth implements the kubernetes.io/ingress-bandwidth / egress-bandwidth annotations via Linux tc.
How a Pod actually gets wired: kubelet → CRI → CNI
This is the call path interviewers love and most people get fuzzy on. Here is the full chain, step by step, for a Pod being scheduled onto a node:
- The kubelet sees a Pod assigned to its node (via the API server watch) and decides to start it. The kubelet does not call CNI directly. It talks to the container runtime over the CRI (Container Runtime Interface) gRPC API.
- CRI: create the sandbox. The kubelet calls
RunPodSandboxon the CRI runtime (containerd or CRI-O). The runtime creates the pause container (also called the sandbox or infra container) — a tiny, do-nothing container whose only job is to hold the Pod’s network namespace (and IPC/UTS namespaces) open. Every real container in the Pod will later join this namespace, which is why all containers in a Pod share one IP. The pause container is what keeps the namespace alive even while app containers restart. - Runtime → CNI: ADD. With the sandbox’s network namespace created, the container runtime (not the kubelet) invokes the CNI plugin with
CNI_COMMAND=ADD, passingCNI_NETNS= the sandbox’s netns path,CNI_IFNAME=eth0, the Pod’s identity inCNI_ARGS, and the.confliston stdin. (containerd does this via its built-in CNI library; CRI-O similarly. This is the bridge between “Kubernetes-land” and “CNI-land”.) - The CNI plugin does the real work inside the netns:
- IPAM is consulted first — the plugin (or its
ipamsub-plugin) allocates a free Pod IP from this node’s slice of the Pod CIDR. - A
vethpair is created — a virtual Ethernet cable with two ends. One end (eth0) is moved into the Pod’s network namespace and configured with the allocated IP, a default route, and DNS. The other end stays in the host namespace, attached to the datapath (a Linux bridge, or directly to routing, or to an eBPF program, depending on the plugin). - Routes are programmed so that the flat-network rules hold: the host knows how to reach this Pod IP, and the Pod knows how to reach everything else (typically a default route via the host end of the veth).
- IPAM is consulted first — the plugin (or its
- CNI returns the
Result(the assigned IP, routes, DNS) to the runtime, which returns the sandbox status to the kubelet. The Pod now has its IP — this is the IP you see inkubectl get pod -o wide. - CRI: create the app containers. The kubelet now calls
CreateContainer/StartContainerfor each container in the Pod. Crucially, each app container is created to join the sandbox’s existing network namespace — so they all shareeth0and the one Pod IP, and talk to each other overlocalhost. - On Pod deletion, the reverse happens: the kubelet tears down containers, then the runtime calls CNI with
CNI_COMMAND=DEL, which releases the IP back to IPAM and removes the veth/routes, and finally the sandbox is removed.
The mental model to lock in: kubelet speaks CRI to the runtime; the runtime speaks CNI to the plugin; the plugin wires the namespace. The kubelet and CNI never talk directly. (One nuance: the kubelet does read the CNI config dir to decide if the node’s network is ready — that NotReady check in the previous section — but the actual ADD/DEL invocation goes through the runtime.)
IPAM: who hands out the Pod IPs
IPAM (IP Address Management) is the sub-system that allocates and frees Pod IPs. It is invoked as part of CNI ADD (allocate) and DEL (free). There are two broad strategies, and which one your CNI uses has real operational consequences:
| IPAM model | How it works | Pros | Cons / gotchas |
|---|---|---|---|
host-local (per-node ranges) |
Each node owns a fixed slice of the Pod CIDR (e.g. a /24). The host-local plugin tracks allocations in files under /var/lib/cni/networks/<net>/ on that node. Used by Flannel, Calico’s host-local mode, kubenet |
Dead simple, no coordination, very fast | Fixed per-node capacity (a /24 = ~254 Pods/node, full or not); IPs can leak on ungraceful node crashes, leaving stale files that exhaust the range until cleaned; rebalancing is hard |
| Cluster-wide / CRD-backed | A central allocator (often the CNI’s own controller using CRDs like Calico’s IPAMBlock/IPPool, or Cilium’s per-node CIDR with cluster coordination) hands out blocks/IPs dynamically |
Efficient use of address space, can borrow across nodes, supports multiple pools, IP pinning | More moving parts; depends on the CNI’s datastore/agent being healthy |
| Cloud / VPC-native | The node’s IPs come from the cloud’s address space; Pods get real VPC IPs attached to ENIs/secondary IPs (AWS VPC CNI, Azure CNI, GKE) | Pods are first-class in the VPC (security groups, no overlay), great cloud integration | Bounded by ENI/IP-per-instance limits (a real planning constraint on AWS); IP exhaustion at the VPC level |
Two failure modes you must recognise:
- IP exhaustion. If a node’s slice fills (too many Pods, or leaked IPs), new Pods on that node get stuck in
ContainerCreatingwith a CNIADDerror likefailed to allocate for range 0: no IP addresses available in range. The fix is a bigger per-node CIDR, fewer Pods/node, or cleaning leaked allocations. maxPodsvs CIDR mismatch. The kubelet’s--max-pods(default 110) must fit inside the per-node Pod CIDR. A/24(254 usable) comfortably holds 110; a/25(126) is tight; a/26(62) will refuse Pods long beforemaxPods.
The node’s Pod CIDR is visible on the Node object (kubectl get node <n> -o jsonpath='{.spec.podCIDR}') when the controller-manager’s --allocate-node-cidrs is on (the common setup with Flannel/Calico host-local). Cloud and cluster-wide CNIs may not populate it.
The plugin families: overlay, L3/BGP, and eBPF
All CNIs satisfy the same model, but they make the flat network real in fundamentally different ways. There are three families.
Family 1 — Overlay (encapsulation): VXLAN / Geneve / IP-in-IP
An overlay builds a virtual L2/L3 network on top of the existing node network by encapsulating each Pod packet inside another packet addressed node-to-node. The classic example is Flannel in VXLAN mode (Calico also offers VXLAN and IP-in-IP).
How it works: when a Pod on node A sends to a Pod on node B, the source packet (src = Pod-A-IP, dst = Pod-B-IP) is wrapped in an outer UDP/VXLAN header addressed src = node-A-IP, dst = node-B-IP. It travels across the physical network as ordinary node-to-node UDP, is decapsulated on node B, and the inner packet is delivered to the Pod. The underlying network never needs to know about Pod IPs at all — it only ever sees node-to-node traffic.
- Pros: Works on any network, including ones you don’t control (you can’t add routes to the cloud’s routers, or your switches don’t speak BGP). Zero requirements on the underlay beyond node-to-node IP connectivity. Simplest to get running anywhere.
- Cons: Encapsulation overhead — every packet carries ~50 bytes of extra headers, which eats into the MTU (you typically drop the Pod MTU to ~1450 to avoid fragmentation; a misconfigured MTU is a top cause of “large responses hang” bugs). Encap/decap costs CPU. The real Pod IPs are invisible to the physical network (harder to inspect/firewall at the infra level).
Family 2 — L3 routing / BGP (no encapsulation)
The routed approach makes Pod IPs real routable addresses on the underlay — no wrapping. The canonical example is Calico in BGP mode. Each node runs a BGP speaker that advertises “this node owns Pod CIDR 10.244.X.0/24” to its peers (other nodes and/or the physical routers / ToR switches). Now the network itself knows how to route a Pod IP to the right node, and packets travel unencapsulated — src = Pod IP, dst = Pod IP, the whole way.
- Pros: No encapsulation overhead — full MTU, near-native throughput, real Pod IPs on the wire (inspectable, firewall-able at the infra). Scales beautifully and integrates with existing L3 fabrics.
- Cons: The underlay must cooperate — either the nodes are L2-adjacent (so node-to-node routing “just works”), or your routers must accept BGP peering. On many cloud VPCs you can’t peer BGP with the fabric, so pure-BGP needs either
bird-to-birdfull mesh / route reflectors and L2 adjacency, or you fall back to overlay/IP-in-IP for cross-subnet hops. More network expertise to run well.
Family 3 — eBPF (programmable kernel datapath)
eBPF lets you load small, verified programs into the Linux kernel that run at hook points (the network driver XDP/tc ingress/egress, socket operations, etc.) to make packet decisions in the kernel, without iptables. Cilium is the flagship eBPF CNI; Calico also has an eBPF dataplane. eBPF is a datapath technology, orthogonal to the encap question — Cilium can run as an overlay (VXLAN/Geneve) or in direct-routing/native mode — but what defines the family is how it makes decisions: programmable maps and kernel hooks instead of long iptables chains.
- Pros: Replaces kube-proxy entirely (Service load-balancing as eBPF hash-map lookups — see below), enforces NetworkPolicy (including L7/HTTP, DNS-aware, and identity-based) efficiently in the kernel, and gives deep observability (Hubble) with flow-level visibility. Avoids the O(n) iptables blow-up at scale. Can do XDP at the driver for line-rate DDoS drop and load-balancing.
- Cons: Needs a modern kernel (the newer the eBPF features — full kube-proxy replacement, bandwidth manager, big TCP — the newer the kernel you want). A steeper mental model and newer tooling. The power is real but so is the learning curve.
The comparison that matters
| Dimension | Overlay (Flannel VXLAN) | L3/BGP (Calico) | eBPF (Cilium) |
|---|---|---|---|
| Encapsulation | Yes (VXLAN/Geneve) | No (native routing) | Optional (native or overlay) |
| Datapath | Linux bridge + kernel routing | Kernel routing + iptables | eBPF programs (tc/XDP) |
| MTU impact | ~50 B overhead, lower Pod MTU | Full MTU | Full MTU (native) |
| Throughput / CPU | Encap cost | Near-native | Near-native, often best |
| Underlay requirement | None (any IP network) | BGP peering or L2 adjacency | None (overlay) or routing (native) |
| kube-proxy | Still needed | Still needed (or eBPF mode) | Replaced by eBPF |
| NetworkPolicy | Needs a policy-capable CNI (Flannel alone = none) | Yes (iptables/ipset) | Yes, incl. L7 & DNS-aware |
| Observability | Basic | Good (felix, flow logs) | Excellent (Hubble) |
| Best for | “Just make it work anywhere” | Performance on a cooperative L3 fabric | Scale, security, observability |
A crucial gotcha that bites beginners: Flannel on its own does not implement NetworkPolicy — it only provides connectivity. If you kubectl apply a NetworkPolicy on a Flannel-only cluster, it is silently a no-op (nothing enforces it). You need a policy-capable CNI (Calico, Cilium) or the Calico-for-policy + Flannel-for-networking combo (“Canal”). “My NetworkPolicy isn’t blocking anything” is, nine times out of ten, “my CNI doesn’t enforce policy.”
NetworkPolicy enforcement: where the firewall actually lives
A NetworkPolicy is a Kubernetes object describing allowed ingress/egress for a set of Pods (by label selector). But Kubernetes does not enforce it — it has no datapath. The CNI plugin enforces it, which is why an enforcing CNI is mandatory for policies to do anything. How it is enforced depends on the plugin:
| Enforcement engine | How a policy becomes a packet decision | Scale behaviour |
|---|---|---|
| iptables + ipset (Calico Felix, kube-router, etc.) | The agent watches Pods + policies and programs iptables rules keyed on ipsets (kernel hash sets of Pod IPs) so rules don’t grow per-IP. A packet to/from a selected Pod traverses these chains; no matching allow → drop | Good with ipsets; without them, rule count explodes |
| eBPF maps (Cilium) | The agent assigns each set of identical-label Pods a numeric security identity, stores policy as eBPF map lookups keyed on (source identity → dest identity → port), and decisions happen in-kernel at the tc/XDP hook. Supports L3/L4, L7 (HTTP/gRPC/Kafka), and DNS-based rules | O(1) map lookups; scales to large clusters; far richer (L7) |
The model is default-allow until a policy selects a Pod: the instant any NetworkPolicy selects a Pod for a direction (ingress or egress), that Pod becomes default-deny for that direction and only the listed traffic is permitted. The CNI agent (Calico’s Felix, Cilium’s agent) is the controller that translates NetworkPolicy (and CRDs like CiliumNetworkPolicy/GlobalNetworkPolicy) into the kernel rules. Two consequences: policy enforcement is only as healthy as that agent DaemonSet (if it crashes, policy can go stale), and identity-based enforcement (Cilium) survives Pod IP churn better than IP-based enforcement, because identities are tied to labels, not addresses. The dedicated Cilium, eBPF & Hubble lesson goes deep on identities, L7 policy and flow observability.
kube-proxy versus the eBPF kube-proxy replacement
The CNI gives you Pod-to-Pod. But a Service ClusterIP is a virtual IP that exists on no interface — something must rewrite “packet to 10.96.0.10:80” into “packet to a real backing Pod IP”. Classically that something is kube-proxy, a per-node DaemonSet that watches Services + EndpointSlices and programs the kernel. It has three datapath modes:
| Mode | Mechanism | Behaviour at scale | Notes |
|---|---|---|---|
| iptables (default) | A tree of iptables chains; each Service → a chain that DNATs to a backing Pod, selecting endpoints with a statistic/probability match |
Rule evaluation is roughly O(n) in Services×endpoints; thousands of Services → large rulesets, slower updates, latency on programming changes | Ubiquitous, well understood, “good enough” for most clusters |
| IPVS | Uses the kernel’s IP Virtual Server (L4 load balancer) with real hashing; supports rr, lc, dh, sh, wrr, etc. scheduling |
O(1)-ish lookups, scales to many thousands of Services far better than iptables | Needs IPVS kernel modules; still uses some iptables for edge cases (masquerade, NodePort) |
| nftables (newer, GA in recent K8s) | Reimplements the proxy on nftables (the iptables successor) with set-based matching | Much better update/lookup scaling than legacy iptables | The intended long-term successor to the iptables mode |
In all kube-proxy modes the flow is the same shape: a Pod sends to the ClusterIP, the kernel on that Pod’s own node DNATs the destination to a chosen endpoint Pod IP (and SNATs/masquerades if the endpoint is on another node and externalTrafficPolicy/internal policy requires it), and the packet then rides the CNI flat network to the endpoint. kube-proxy is not in the data path as a process — it only programs the rules; the kernel does the rewriting. (This is why killing kube-proxy doesn’t instantly break existing connections — the rules persist — but it stops new Services/endpoints from being reflected.)
The eBPF replacement
Cilium (and Calico’s eBPF mode) can disable kube-proxy entirely and implement Service load-balancing as eBPF programs with hash-map lookups attached at the socket and tc/XDP layers. Instead of walking iptables chains, a connection to a ClusterIP is resolved by an O(1) map lookup — and with socket-level load balancing, the DNAT can happen at connect() time so the packet is born addressed to the right backend, skipping per-packet NAT overhead entirely. NodePort/LoadBalancer can be handled at XDP (in the NIC driver, before the kernel network stack) for line-rate performance.
| Aspect | kube-proxy (iptables/IPVS) | eBPF replacement (Cilium) |
|---|---|---|
| Service lookup | iptables chain walk (O(n)) / IPVS hash | eBPF hash-map (O(1)) |
| Where DNAT happens | Per-packet in netfilter | At connect() (socket LB) and/or tc/XDP |
| Scale (1000s of Services) | iptables degrades; IPVS okay | Flat, excellent |
| Extra capabilities | L4 only | Integrated with policy, observability (Hubble), DSR, Maglev hashing |
| Requirement | Works on old kernels | Modern kernel for full replacement |
To make this concrete with a single packet: a Pod calls web.default.svc.cluster.local:80. (1) CoreDNS resolves the name to the Service’s ClusterIP. (2) The Pod sends to ClusterIP:80. (3) kube-proxy’s kernel rules or Cilium’s eBPF DNAT that to a ready backing Pod IP chosen from the EndpointSlice. (4) The packet rides the CNI flat network to that Pod’s node and into its netns. Four layers — DNS, Service-VIP translation, the CNI fabric, the Pod namespace — each owned by a different component.
Intra-cluster DNS: CoreDNS on the flat network
DNS is the last piece that makes the flat network usable by humans and apps. CoreDNS runs as a Deployment (fronted by a ClusterIP Service, conventionally 10.96.0.10, named kube-dns for backward compatibility) and is the cluster’s resolver. Every Pod’s /etc/resolv.conf is populated by the kubelet (per the Pod’s dnsPolicy, default ClusterFirst) to point at that DNS Service IP, with a search list and ndots:5:
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
CoreDNS watches the API server and answers:
- A/AAAA for Services:
web.default.svc.cluster.local→ the Service’s ClusterIP. - A/AAAA for headless Services (
clusterIP: None): the individual Pod IPs directly (this is how StatefulSets get stable per-Pod DNS). - SRV records for named ports, and PTR for reverse lookups.
The ndots:5 rule means any name with fewer than 5 dots is first tried with each search suffix appended before being tried as-is — which is why web resolves to web.default.svc.cluster.local from inside the default namespace, but also why a careless external lookup like api.github.com (3 dots) wastes several queries on the search list first. The point for this lesson: CoreDNS rides entirely on the flat Pod network and the Service abstraction — it is “just another Pod behind a ClusterIP”. If your CNI is broken, DNS breaks too, and the symptom (could not resolve host) often looks like a DNS problem when the root cause is the underlying network or kube-proxy. Always test connectivity by raw IP before blaming DNS.
The diagram traces the full stack: the kubelet calling the CRI to create a pause/sandbox container, the runtime invoking the CNI plugin to allocate an IP (IPAM) and wire a veth into the Pod’s netns, the three plugin families making the flat network real (overlay encap vs BGP routes vs eBPF), and a Service request being DNAT’d by kube-proxy or eBPF onto that flat network — with CoreDNS resolving the name alongside.
Hands-on lab
This lab runs free on a local cluster. We will inspect the real CNI configuration and binaries, trace a Pod’s IP and its veth/namespace, prove the flat network, observe IPAM, and watch how a NetworkPolicy changes behaviour. We use kind, which uses a CNI under the hood and lets us shell into nodes.
Note: kind ships with a simple default CNI (
kindnet). For the NetworkPolicy step we deploy Calico so policy is actually enforced — demonstrating the “Flannel/kindnet alone doesn’t enforce policy” point from above.
1. A multi-node cluster
cat <<'EOF' | kind create cluster --name cni-lab --config -
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
podSubnet: "10.244.0.0/16" # the Pod CIDR
serviceSubnet: "10.96.0.0/16" # the Service CIDR
nodes:
- role: control-plane
- role: worker
- role: worker
EOF
kubectl get nodes -o wide
# Note each node's INTERNAL-IP (node network) — distinct from Pod IPs.
2. See the CNI config and binaries on a node
# Shell into a node (kind nodes are containers)
docker exec -it cni-lab-worker bash
# The CNI config that the runtime reads:
ls -l /etc/cni/net.d/
cat /etc/cni/net.d/*.conflist 2>/dev/null || cat /etc/cni/net.d/*.conf
# -> shows cniVersion, name, the plugin chain (type, ipam, etc.)
# The CNI plugin binaries the runtime invokes:
ls -l /opt/cni/bin/
# -> bridge, host-local, portmap, loopback, ... (and the CNI's own binary)
exit
You have just seen the two things that are a CNI install: a config file and some executables.
3. Deploy Pods and inspect IPs, namespaces and veths
kubectl create deployment web --image=nginx:1.27 --replicas=3
kubectl set resources deployment web --requests=cpu=50m,memory=32Mi
kubectl rollout status deployment/web
kubectl get pods -l app=web -o wide
# Each Pod has a unique IP from 10.244.0.0/16, spread across nodes.
# Show the node's Pod CIDR slice (IPAM range for that node):
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
Now look at the Linux plumbing on a node that hosts a web Pod:
POD=$(kubectl get pod -l app=web -o jsonpath='{.items[0].metadata.name}')
NODE=$(kubectl get pod "$POD" -o jsonpath='{.spec.nodeName}')
echo "$POD is on $NODE"
docker exec -it "$NODE" bash
# The host side of the veth pairs (one per Pod on this node):
ip -d link show | grep -A1 veth
# The host routes that send Pod-CIDR traffic to the right place:
ip route | grep 10.244
exit
You are seeing the veth pairs (the host end of each Pod’s virtual cable) and the routes the CNI programmed — the concrete realisation of the flat network.
4. Prove the flat, NAT-free network
# From one Pod, curl another Pod by its raw Pod IP (no Service involved):
PODA=$(kubectl get pod -l app=web -o jsonpath='{.items[0].metadata.name}')
IPB=$(kubectl get pod -l app=web -o jsonpath='{.items[1].status.podIP}')
NODEA=$(kubectl get pod "$PODA" -o jsonpath='{.spec.nodeName}')
NODEB=$(kubectl get pod -l app=web -o jsonpath='{.items[1].spec.nodeName}')
echo "Pod A on $NODEA -> Pod B IP $IPB on $NODEB"
kubectl exec "$PODA" -- curl -s -o /dev/null -w "%{http_code}\n" "http://$IPB"
# -> 200, even across nodes, addressing Pod B by its real IP. That is the flat network.
# Confirm the destination sees the *source Pod's real IP* (no NAT) — check nginx logs:
kubectl logs "$(kubectl get pod -l app=web -o jsonpath='{.items[1].metadata.name}')" | tail -1
# The client IP in the log is Pod A's Pod IP, not a node IP.
5. Watch IPAM allocate and release
# Scale up and watch new Pods get fresh IPs from the CIDR:
kubectl scale deployment web --replicas=6
kubectl get pods -l app=web -o wide -w # Ctrl-C after they are Running
# Scale down: those IPs are released back to IPAM (CNI DEL).
kubectl scale deployment web --replicas=2
6. NetworkPolicy enforcement (with a policy-capable CNI)
# A client Pod and a default-deny test:
kubectl run client --image=curlimages/curl --restart=Never -- sleep 3600
kubectl expose deployment web --port=80 # ClusterIP Service "web"
kubectl wait --for=condition=Ready pod/client
# Baseline: client CAN reach the Service (no policy yet):
kubectl exec client -- curl -s -o /dev/null -w "%{http_code}\n" http://web
# Apply a default-deny ingress policy on the web Pods:
cat <<'EOF' | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: web-deny-ingress
spec:
podSelector:
matchLabels:
app: web
policyTypes: ["Ingress"] # selects web Pods for Ingress -> default-deny ingress
EOF
# Now the client is blocked (times out) — IF the CNI enforces policy:
kubectl exec client -- curl -s --max-time 4 -o /dev/null -w "%{http_code}\n" http://web || echo "blocked (expected)"
If you ran this on a CNI that doesn’t enforce policy (plain Flannel/kindnet), the curl would still return 200 — proving the policy is a no-op without an enforcing datapath. On Calico/Cilium it times out.
7. See the Service VIP translation
# The Service has a ClusterIP that lives on no interface:
kubectl get svc web # note the CLUSTER-IP, e.g. 10.96.x.y
# Inspect the kernel rules kube-proxy programmed for it (iptables mode):
docker exec cni-lab-worker bash -c "iptables-save -t nat | grep -i web | head"
# -> KUBE-SVC-* / KUBE-SEP-* chains that DNAT the ClusterIP to backing Pod IPs.
Validation
You should have observed: (a) a CNI config + binaries on disk; (b) unique Pod IPs from the Pod CIDR, with per-node CIDR slices; © veth pairs and host routes implementing the flat network; (d) cross-node Pod-to-Pod by raw IP with the real source IP preserved (no NAT); (e) IPs allocated/freed on scale; (f) a NetworkPolicy that blocks traffic only because the CNI enforces it; (g) the iptables NAT chains that turn a ClusterIP into a Pod IP.
Cleanup
kubectl delete networkpolicy web-deny-ingress
kubectl delete deployment web
kubectl delete svc web
kubectl delete pod client
kind delete cluster --name cni-lab
Cost note
Everything ran locally in Docker via kind — zero cloud cost. The only resource consumed is local CPU/RAM; deleting the kind cluster reclaims it all.
Common mistakes & troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
Nodes stuck NotReady, cni plugin not initialized / NetworkPluginNotReady |
No CNI installed, or its config absent from /etc/cni/net.d/ |
Install a CNI (apply its manifest); confirm a .conflist and binaries appear and the CNI DaemonSet is Running |
Pods stuck ContainerCreating with failed to allocate IP / no IP addresses available |
IPAM exhaustion — node CIDR full or leaked allocations | Enlarge the per-node Pod CIDR / cluster CIDR, reduce maxPods, or clean leaked host-local files under /var/lib/cni/networks/ |
| NetworkPolicy doesn’t block anything | CNI doesn’t enforce policy (plain Flannel/kindnet), or the policy agent (Felix/Cilium) is unhealthy | Use a policy-capable CNI (Calico/Cilium/Canal); check the agent DaemonSet is Running on every node |
| Cross-node Pod traffic fails; same-node works | Overlay/routing broken: wrong MTU, blocked VXLAN UDP (8472/4789), missing BGP peering, or a node firewall/security-group dropping it | Verify CNI pods healthy on both nodes; check MTU; open the overlay port / BGP; check cloud SG/NACL allows node-to-node + Pod CIDR |
| Large responses hang, small ones work | MTU mismatch (encap overhead not accounted for); PMTU black hole | Lower the Pod MTU (e.g. 1450 for VXLAN), ensure consistent MTU end to end |
Pod has no IP / kubectl get pod -o wide shows <none> long after scheduling |
CNI ADD failing (bad config, missing binary, IPAM datastore down) |
kubectl describe pod for the CNI error; check the CNI agent logs on that node; verify /opt/cni/bin has the plugin |
| Service unreachable by name but reachable by ClusterIP | CoreDNS problem (Pods crashlooping, ndots/search misconfig, wrong dnsPolicy) |
kubectl -n kube-system get pods -l k8s-app=kube-dns; test with FQDN; check the Pod’s /etc/resolv.conf |
| Service unreachable by ClusterIP but Pod reachable by Pod IP | kube-proxy problem (crashed, wrong mode, missing IPVS modules) or no ready endpoints | Check kube-proxy DaemonSet; kubectl get endpointslices; verify endpoints exist and are Ready |
| Client IP shows as a node IP, not the real source | SNAT/masquerade on the path (Service externalTrafficPolicy: Cluster, or egress masquerade) |
Use externalTrafficPolicy: Local where client IP matters; understand which hops SNAT |
Best practices
- Choose the CNI deliberately, up front. It is hard to swap later. Pick based on environment (can you peer BGP? what kernel?) and needs (L7 policy? observability? raw throughput?). Overlay for “anywhere”, BGP for a cooperative L3 fabric, eBPF for scale/security/observability.
- Size the Pod CIDR for growth. Plan per-node
/24s (or the cloud’s IP-per-node limits) against your realmaxPods. Running out of Pod IPs is a painful, cluster-wide problem to retrofit. - Get the MTU right before you have a production incident. Overlays need a reduced Pod MTU; verify end to end. MTU bugs are silent until a large payload hangs.
- Run the CNI as a healthy, monitored DaemonSet. Connectivity and policy depend on the agent on every node. Alert on its readiness and restarts.
- Default-deny, then allow. Adopt a default-deny NetworkPolicy posture per namespace and explicitly allow required flows — but only after confirming your CNI enforces policy.
- Prefer IPVS/nftables or the eBPF replacement at scale. Legacy iptables kube-proxy degrades past a few thousand Services; choose a mode that scales for large clusters.
- Keep Pod CIDR, Service CIDR and node network non-overlapping — and document them. Overlaps cause maddening intermittent failures.
- Test connectivity by IP before blaming DNS. Isolate “is the flat network up?” from “is name resolution working?” — they fail with similar-looking symptoms.
Security notes
- NetworkPolicy is your in-cluster firewall — but only with an enforcing CNI. A flat, NAT-free network means any Pod can reach any other Pod by default. Without policies, a single compromised Pod can talk to every database and API in the cluster. Adopt default-deny and least-privilege egress/ingress.
- Egress control matters as much as ingress. Restrict which external destinations Pods can reach (egress policies, or DNS-aware/L7 policies in Cilium) to contain data exfiltration from a compromised workload.
- Identity-based policy is more robust than IP-based in a churny cluster: tying rules to labels/identities (Cilium) rather than ephemeral Pod IPs avoids gaps when Pods reschedule. See the Cilium/eBPF/Hubble lesson.
- Encapsulation is not encryption. VXLAN wraps but does not encrypt Pod traffic — it is plaintext on the wire. For confidentiality between nodes, enable the CNI’s transparent encryption (WireGuard/IPsec in Calico/Cilium) or use a service mesh’s mTLS.
- Protect the overlay/BGP control plane. Unauthenticated BGP peering or an open VXLAN port to untrusted hosts can let an attacker inject routes or join the Pod network. Lock node-to-node traffic to known peers via firewalls/security groups.
hostNetwork: truePods bypass the CNI (they share the node’s namespace and IP) — and thus bypass Pod-level NetworkPolicy in most CNIs. Treat host-networked Pods as privileged and minimise them.- Watch the CNI’s own RBAC and credentials. The CNI agent typically has broad cluster access (to watch Pods/policies and write CRDs). A compromised CNI is a cluster-wide network compromise.
Interview & exam questions
1. Walk me through exactly how a Pod gets its IP — who calls whom?
The kubelet talks to the container runtime over the CRI (RunPodSandbox), which creates the pause/sandbox container holding the Pod’s network namespace. The runtime then invokes the CNI plugin with CNI_COMMAND=ADD, passing the netns path. The CNI plugin calls IPAM to allocate an IP from the node’s Pod-CIDR slice, creates a veth pair (one end eth0 inside the netns with the IP, the other in the host), programs routes, and returns the Result. The kubelet never calls CNI directly.
2. State the Kubernetes network model’s rules. Every Pod gets a unique routable IP; every Pod can reach every other Pod on any node without NAT; every node can reach every Pod without NAT; and a Pod sees itself by the same IP others use to reach it. IP-per-Pod, flat, NAT-free.
3. What is the CNI, concretely — a daemon?
A specification plus executable plugins. The core contract is binaries in /opt/cni/bin/ that the runtime invokes (JSON in/out) per the config in /etc/cni/net.d/, implementing ADD/DEL/CHECK/GC. (Modern CNIs also run an agent DaemonSet for policy/IPAM, but the invocation itself is a short-lived binary call.)
4. Overlay vs BGP — when do you pick each? Overlay (VXLAN/Geneve) encapsulates Pod packets node-to-node — works on any network, at the cost of header overhead and lower MTU. BGP/L3 routing advertises Pod CIDRs so packets travel unencapsulated at full MTU and native speed — but requires the underlay to cooperate (BGP peering or L2 adjacency). Overlay for networks you don’t control; BGP for a cooperative L3 fabric.
5. Why is eBPF a different kind of CNI? eBPF is a datapath technology: verified programs in kernel hooks (tc/XDP/socket) make packet decisions via maps, replacing iptables. Orthogonal to encapsulation (Cilium can overlay or route). It enables kube-proxy replacement (O(1) Service lookups, socket-level LB), L7/identity-based NetworkPolicy, and deep observability (Hubble).
6. Who enforces NetworkPolicy? The CNI plugin, not Kubernetes. Kubernetes has no datapath. Calico/Cilium translate policies into iptables+ipset rules or eBPF maps. With a non-enforcing CNI (plain Flannel), NetworkPolicy is a silent no-op.
7. iptables vs IPVS kube-proxy — what’s the difference and why does it matter? Both turn a ClusterIP into a Pod IP via kernel rules. iptables mode walks chains roughly O(n) in Services×endpoints — fine small, slow at thousands of Services. IPVS uses the kernel’s L4 load balancer with real hashing for O(1)-ish lookups and far better scaling, plus multiple scheduling algorithms. (nftables mode is the modern successor; eBPF replaces kube-proxy entirely.)
8. What is IPAM and how does it fail?
IP Address Management allocates/frees Pod IPs during CNI ADD/DEL. host-local gives each node a fixed CIDR slice (simple, but fixed capacity and can leak on crashes); cluster-wide/CRD allocators borrow across nodes; VPC-native uses real cloud IPs (bounded by ENI limits). Failure = ContainerCreating with “no IP addresses available” when a range is exhausted.
9. What is the pause container and why does it exist?
A tiny no-op container created per Pod that holds the network (and IPC/UTS) namespace open. App containers join its namespace, so they share one IP and talk over localhost. It keeps the namespace alive across app-container restarts.
10. A Pod can reach another Pod by IP but not by Service name. Where do you look?
That isolates the problem to the Service/DNS layer, not the CNI flat network (which clearly works). Check CoreDNS (pods healthy? resolv.conf/ndots/search correct? FQDN works?) and, if even the ClusterIP fails, kube-proxy (mode, health) and whether the Service has ready endpoints.
11. Why does Kubernetes forbid NAT between Pods? To give every workload a real, stable identity on a flat L3 network — preserving the real source IP (for logs, allow-lists, rate limiting), avoiding NAT’s protocol breakage, and removing a stateful connection-tracking bottleneck between Pods. NAT survives only at the edges (egress, some Service paths).
12. How does encapsulation affect MTU, and what’s the classic bug? VXLAN adds ~50 bytes per packet, so the usable Pod MTU must drop (e.g. to ~1450). If it isn’t lowered (or PMTU discovery is blocked), large packets are dropped and you get the signature “small requests work, large responses hang” bug.
Quick check
- From which range is a Pod’s IP allocated, and who allocates it?
- Which component invokes the CNI plugin — the kubelet or the container runtime?
- Name the three CNI plugin families and their defining trait.
- True or false: applying a NetworkPolicy on a plain-Flannel cluster blocks traffic.
- What does kube-proxy do to a packet destined for a ClusterIP, and is kube-proxy in the data path?
Answers
- From the Pod CIDR, by the CNI plugin via IPAM (during
ADD). - The container runtime (containerd/CRI-O), which the kubelet drives over the CRI. The kubelet does not call CNI directly.
- Overlay (encapsulation — VXLAN/Geneve, e.g. Flannel), L3/BGP (native routing, e.g. Calico), eBPF (programmable kernel datapath, e.g. Cilium).
- False — plain Flannel doesn’t enforce NetworkPolicy, so it’s a silent no-op. You need a policy-capable CNI.
- The kernel DNATs the ClusterIP to a chosen ready backing Pod IP (selected from EndpointSlices), then the packet rides the flat CNI network. kube-proxy only programs the rules; it is not in the data path as a process.
Exercise
On a kind cluster, deliberately reproduce two failure modes and explain each from first principles:
- IPAM exhaustion. Create a kind cluster with a deliberately tiny per-node Pod CIDR (e.g. set a
/27node-cidr-mask via kubeadm config, or schedule far more Pods than the slice allows onto one node with a nodeSelector). Schedule Pods until new ones hang inContainerCreating; capture the exact CNI error fromkubectl describe pod; then show the fix by reducing the Pod count. - Policy is a no-op without enforcement. On the default kindnet cluster, apply a default-deny ingress NetworkPolicy to a web Deployment and prove (with
curlfrom a client Pod) that traffic is still allowed. Then install Calico, re-apply, and prove it is now blocked.
Write up: for each, the symptom, the root cause in the stack (which layer/component), and the precise fix. This is exactly the reasoning a CKA/CKS scenario tests.
Certification mapping
- CKA (Certified Kubernetes Administrator): the Services & Networking domain covers the cluster network model, choosing and understanding a CNI, kube-proxy modes, CoreDNS, and NetworkPolicy at an operator level. Expect to install/diagnose a CNI, debug Pod-to-Pod and Service connectivity from first principles, and reason about the Pod/Service CIDRs — exactly this lesson’s datapath material.
- CKS (Certified Kubernetes Security Specialist): the Cluster Setup and Minimize Microservice Vulnerabilities domains lean on NetworkPolicy (default-deny, ingress/egress, namespace isolation) and understanding that the CNI enforces it. You will write and verify policies and reason about network segmentation and encryption.
- KCNA: the networking fundamentals — the Pod network model, what a CNI is at a conceptual level, Services and DNS — map to its Kubernetes-fundamentals objectives.
Glossary
- Network model — Kubernetes’ contract: IP-per-Pod, flat, NAT-free Pod-to-Pod, node-to-Pod reachability.
- CNI (Container Network Interface) — the spec + plugins that attach containers to a network; the thing that implements the model.
.conflist/.conf— the CNI configuration in/etc/cni/net.d/describing the plugin chain.- CNI bin dir —
/opt/cni/bin/, the plugin executables the runtime invokes. - ADD / DEL / CHECK / GC — the CNI operations: attach / detach / verify / garbage-collect a sandbox’s networking.
- CRI (Container Runtime Interface) — the gRPC API the kubelet uses to drive the container runtime (containerd/CRI-O).
- Pause / sandbox / infra container — the tiny container that holds a Pod’s network namespace open so all containers share one IP.
- Network namespace (netns) — the Linux isolation unit for a network stack; each Pod has its own.
- veth pair — a virtual Ethernet cable with two ends; one end (
eth0) goes inside the Pod’s netns, the other into the host datapath. - IPAM (IP Address Management) — the allocator that hands out/frees Pod IPs (
host-local, cluster-wide/CRD, or VPC-native). - Pod CIDR — the range Pod IPs come from (often sliced per node).
- Service CIDR — the range Service ClusterIPs come from (allocated by the API server).
- Overlay — making the Pod network real by encapsulating packets node-to-node (VXLAN/Geneve/IP-in-IP).
- VXLAN / Geneve — UDP-based encapsulation protocols used by overlays; add header overhead, reduce MTU.
- L3 routing / BGP — making Pod IPs natively routable by advertising Pod CIDRs; no encapsulation.
- eBPF — verified in-kernel programs at network hooks (tc/XDP/socket) for the datapath; basis of Cilium.
- XDP (eXpress Data Path) — an eBPF hook in the NIC driver for line-rate packet processing.
- NetworkPolicy — the Kubernetes object describing allowed Pod ingress/egress; enforced by the CNI, not Kubernetes.
- Felix — Calico’s per-node agent that programs the datapath/policy (iptables+ipset).
- Security identity — Cilium’s numeric identity per label-set, used for identity-based policy independent of Pod IP.
- kube-proxy — the per-node agent that programs kernel rules to turn ClusterIPs into Pod IPs (iptables/IPVS/nftables).
- IPVS — the kernel’s L4 load balancer; a kube-proxy mode that scales better than iptables.
- kube-proxy replacement — implementing Service load-balancing in eBPF instead of kube-proxy (Cilium).
- DNAT / SNAT — destination / source network address translation; DNAT turns a ClusterIP into a Pod IP.
- MTU — maximum transmission unit; must account for encapsulation overhead on overlays.
- CoreDNS — the cluster DNS server resolving Service/Pod names; runs on the flat network behind a ClusterIP.
ndots/ search domains — resolver settings that append cluster suffixes to short names (ndots:5).hostNetwork— a Pod sharing the node’s network namespace/IP, bypassing the CNI and most Pod-level policy.
Next steps
You can now reason about every packet in a cluster from the wire up: the model it must satisfy, the CNI handshake that wires each Pod, how IPs are allocated, how the three plugin families make the flat network real, and how Services and policy are enforced in the datapath. Next, learn how to see what your cluster is doing — metrics, dashboards and alerting: Kubernetes Monitoring, In Depth: metrics-server, Prometheus, Grafana & Alerting. For the deepest dive on the most powerful datapath — eBPF identities, L7/DNS-aware policy and flow observability — go straight to Cilium, eBPF, NetworkPolicy & Hubble Observability. And to see the top-down view of the same Service/DNS machinery this lesson explained from the bottom, revisit Kubernetes Services & Networking.