Kubernetes Troubleshooting Playbooks: Pods, Nodes, Networking, Storage & RBAC

The difference between an engineer who has run Kubernetes for a year and one who is still nervous around it is rarely knowledge of obscure features. It is a method. When a pod is stuck and a manager is asking for an ETA, the strong engineer does not guess — they run a short, fixed sequence of commands that narrows the problem to one layer, form a hypothesis, prove it, fix it, and verify. Everyone else restarts things and hopes.

This lesson gives you that method and then turns it into playbooks: for each common failure you will get a table of symptom → likely cause → the diagnostic command that confirms it → the fix. We cover the failures you genuinely hit in production and on the CKA/CKS exams — pods that crash-loop, fail to pull images, get OOM-killed, sit Pending or get Evicted; nodes that go NotReady; Services with no endpoints, DNS that fails, NetworkPolicies that silently drop traffic, and Ingress controllers returning 502; storage that never binds or never mounts; and the RBAC Forbidden and admission-webhook denials that block deploys. By the end you will diagnose any of these on a free local cluster without guessing.

Learning objectives

By the end of this lesson you can:

Apply a repeatable troubleshooting loop — observe, isolate the layer, compare desired vs actual, hypothesise, fix, verify, prevent — to any Kubernetes failure.
Drive the four core diagnostic commands (kubectl get, describe, logs, and events) fluently, and know exactly which one answers which question.
Diagnose every common Pod failure: CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending, and Evicted.
Triage a NotReady node and the node-pressure conditions (MemoryPressure, DiskPressure, PIDPressure) behind it.
Debug Service/DNS/NetworkPolicy/Ingress problems — “no endpoints”, CoreDNS failures, default-deny drops, and Ingress 502s.
Fix storage failures — a Pending PVC, mount/attach errors, and access-mode mismatches — and RBAC/admission denials.

Prerequisites & where this fits

You need a working kubectl, a cluster you can break (a free local one is ideal — kind, minikube, or k3d), and the core object model from earlier in the course: Pods, ReplicaSets, Deployments & Services and the control-plane/node architecture. It also helps to have met RBAC, NetworkPolicies, and CSI storage, though we re-explain each failure from first principles. This is the Troubleshooting lesson of the Kubernetes Zero-to-Hero course: the bridge between knowing the objects and operating them under pressure. The next lesson steps up to cluster-level incidents and root-cause analysis. Everything here runs on free, local tooling — no cloud account, no charges.

The method: a loop, not a guess

Almost every Kubernetes problem yields to the same six-step loop. The discipline is to follow it in order rather than jumping to a fix you have used before.

Observe. Look at reality before forming any opinion. kubectl get for the high-level state, then kubectl describe for the detail and Events, then kubectl logs for what the app itself said. Read the cluster’s own event stream with kubectl get events --sort-by=.lastTimestamp. Resist the urge to change anything yet.
Isolate the layer. Kubernetes is layered — scheduling → image/runtime → application → networking → storage → policy/auth. Decide which layer the symptom lives in. A pod that never gets a node is a scheduling problem; a pod that runs but a Service can’t reach it is a networking problem. Naming the layer eliminates 80% of the search space.
Compare desired vs actual. Kubernetes is declarative: every controller is trying to make actual match desired. Most bugs are a gap between the two — a Service selector that doesn’t match the pod labels, a replica count that the scheduler can’t satisfy, an image tag that doesn’t exist. Put the spec next to the live state and look for the mismatch.
Form one hypothesis. State it as a falsifiable sentence: “the Service has no endpoints because its selector is app=web but the pods are labelled app=frontend.” A vague hunch (“networking is broken”) cannot be tested; a specific claim can.
Fix — the smallest change that tests the hypothesis. Edit the manifest and re-apply (kubectl apply -f), not a manual hot-patch you’ll forget. If the fix proves the hypothesis, you’ve also found the root cause. If it doesn’t, you’ve cheaply eliminated one possibility — go back to step 4.
Verify, then prevent. Confirm the symptom is gone, not just that “it looks better”: kubectl get pods all Running/Ready, kubectl get endpoints populated, kubectl rollout status complete. Then ask the prevention question: what guardrail (a probe, a resource request, a default-deny policy with the right allow-rule, a CI check) stops this recurring?

Kubernetes troubleshooting decision tree

The decision tree above encodes step 2: start from the symptom at the top, branch on “is the pod scheduled / running / reachable?”, and each leaf points you at the playbook below. When you are under pressure, walking the tree keeps you honest about which layer you are in before you touch anything.

The four commands that answer everything

You can solve the large majority of problems with four commands. Know precisely what each one tells you — this is what stops the flailing.

Command	Question it answers	Where to look
`kubectl get <resource> -o wide`	What is the high-level state? Which node, which IP, how many ready?	`STATUS`, `RESTARTS`, `READY`, `NODE` columns
`kubectl describe <resource>`	Why is it in that state?	The Events section at the bottom — it states reasons verbatim
`kubectl logs <pod> [-c <container>] [--previous]`	What did the application itself say?	App stdout/stderr; `--previous` shows the last crashed container
`kubectl get events --sort-by=.lastTimestamp`	What has the cluster been doing recently, across objects?	Chronological reasons: scheduling, pulls, probe failures, evictions

A few high-leverage habits: alias kubectl to k and turn on completion; use -o wide by default so you always see the node and IP; reach for --previous the instant you see RESTARTS > 0; and remember that describe reads Events from the API, but they expire (~1 hour) — for older history use kubectl get events or your logging stack. For deeper inspection, kubectl get <res> -o yaml shows the full live object including .status, and kubectl debug (ephemeral containers, GA since v1.25) lets you attach a toolbox container to a running or distroless pod without rebuilding the image.

Pods: the workload-level playbook

Most incidents are pod-level, and the STATUS column plus kubectl describe Events almost always name the cause. The mental split is scheduling vs runtime vs application: Pending is scheduling (the pod has no node yet); ImagePullBackOff is runtime (the node can’t get the image); CrashLoopBackOff and OOMKilled are application/resource (the container started and then died); Evicted is node pressure pushing a pod off after it ran.

Symptom (`STATUS` / signal)	Likely cause	Diagnostic command	Fix
`CrashLoopBackOff`	App exits non-zero on start (bad config, missing env/secret, failed migration, wrong command), or a too-aggressive liveness probe kills it before it’s ready	`kubectl logs <pod> --previous`; `kubectl describe pod <pod>` (exit code, `Last State: Terminated`, probe events)	Fix the app error the logs show (supply the env/Secret/ConfigMap, correct the `command`/`args`); if it’s the probe, fix the probe target/port or add a `startupProbe` so slow starts aren’t killed
`ImagePullBackOff` / `ErrImagePull`	Image name/tag typo, tag doesn’t exist, private registry without `imagePullSecrets`, registry rate-limited, or wrong architecture	`kubectl describe pod <pod>` — Events show the exact pull error (`manifest unknown`, `unauthorized`, `429 Too Many Requests`)	Correct the image/tag; add an `imagePullSecret` (`kubectl create secret docker-registry …`) and reference it on the pod/ServiceAccount; for rate limits, authenticate or use a pull-through cache
`OOMKilled` (in `Last State`, exit code 137)	Container exceeded its memory limit (or the node ran out and the kernel OOM-killer chose it)	`kubectl describe pod <pod>` → `Last State: Terminated, Reason: OOMKilled`; `kubectl top pod <pod>`	Raise the memory limit (and set a matching request); fix the app’s leak/heap; right-size with VPA recommendations — don’t just keep bumping the limit blindly
`Pending`	Scheduler can’t place it: insufficient CPU/memory on any node, an untolerated taint, unsatisfiable node/pod affinity or topology spread, or an unbound PVC	`kubectl describe pod <pod>` — Events: “Insufficient cpu/memory”, “had untolerated taint”, “didn’t match node selector”, “had volume node affinity conflict”	Lower the request, add a toleration/the right `nodeSelector`/label, add capacity (or let Cluster Autoscaler), or fix the PVC (see Storage). `kubectl get nodes -o wide` and `kubectl describe node` show allocatable headroom
`Pending` with “0/N nodes available” and no obvious resource line	No schedulable nodes (all cordoned/NotReady), or pod requests a resource (GPU, specific zone) nothing offers	`kubectl get nodes`; `kubectl describe pod` Events	Uncordon/repair nodes (see Nodes); relax the constraint or provision matching capacity
`Evicted` (pod text shows “The node was low on resource: …”)	Node pressure — `DiskPressure`/`MemoryPressure` triggered the kubelet to evict lower-priority/over-request pods	`kubectl get pod <pod> -o yaml` → `status.message`; `kubectl describe node <node>` (Conditions)	Clear node pressure (free disk/images, fix the noisy pod); set proper requests and a sensible PriorityClass so critical pods aren’t evicted first; delete the dead `Evicted` objects (`kubectl delete pod --field-selector status.phase=Failed`)
`Init:Error` / `Init:CrashLoopBackOff`	An init container is failing (waiting on a dependency, bad migration, missing config)	`kubectl logs <pod> -c <init-container>`; `kubectl describe pod` shows which init step is stuck	Fix that init container’s command/config or the dependency it waits for; init containers run to completion in order before app containers start
`ContainerCreating` (stuck)	Volume not attaching/mounting, missing Secret/ConfigMap referenced as a volume, CNI not assigning an IP, or image still pulling	`kubectl describe pod <pod>` — Events name it (`FailedMount`, `FailedCreatePodSandBox`, secret “not found”)	Create the missing Secret/ConfigMap; resolve the volume issue (see Storage); check CNI health on the node
`Completed` but expected to keep running	Container’s main process exited 0 (script finished, wrong `command`, ran as a one-shot)	`kubectl logs <pod>`; check `spec.containers[].command`	Make it a long-running process or use a Job/CronJob if it’s genuinely one-shot
Pod `Running` but not `Ready` (`0/1`)	Readiness probe failing — app up but not serving, dependency down, wrong probe path/port	`kubectl describe pod <pod>` (Readiness probe events); `kubectl logs <pod>`	Fix the probe target or the dependency; a failing readiness probe (correctly) keeps the pod out of Service endpoints — that is the link to “no endpoints” below
`Terminating` (stuck for minutes)	A finalizer is blocking deletion, or the node is gone and the pod is orphaned	`kubectl get pod <pod> -o yaml` → `metadata.finalizers`, `deletionTimestamp`	Resolve/remove the finalizer’s owner; for a dead node, `kubectl delete pod --grace-period=0 --force` (last resort)

Two reflexes worth burning in. First, --previous for any crash: CrashLoopBackOff means the container already died and restarted, so the current container has no useful logs — kubectl logs <pod> --previous shows the failed run. Second, exit codes are a shortcut: 137 is SIGKILL (usually OOM or a liveness kill), 143 is SIGTERM (graceful shutdown), 1/2 are generic app errors, and 126/127 mean the command isn’t executable / not found.

Nodes: when the machine, not the pod, is the problem

When many pods on one node misbehave at once — evictions, ContainerCreating, sudden NotReady pods — suspect the node, not the workloads. A node has a small set of Conditions the kubelet reports, and reading them is the whole game.

Symptom	Likely cause	Diagnostic command	Fix
Node `NotReady`	kubelet stopped/crashed, lost API-server connectivity, CNI not ready, or the host is down	`kubectl describe node <node>` (Conditions, last heartbeat); on the host: `systemctl status kubelet`, `journalctl -u kubelet`	Restart/repair the kubelet; fix networking to the control plane; reinstall/repair the CNI; if hardware, drain and replace
`MemoryPressure = True`	Node out of allocatable memory; kubelet starts evicting	`kubectl describe node <node>`; `kubectl top node`	Reduce load (evict/reschedule), set proper requests, add nodes; find the offender with `kubectl top pod -A --sort-by=memory`
`DiskPressure = True`	Disk/inode exhaustion — often image bloat or container logs filling the disk	`kubectl describe node <node>`; on host `df -h`, `crictl images`	kubelet garbage-collects images/containers automatically; if not enough, prune images, rotate logs, grow the disk, move `/var/lib/containerd` to bigger storage
`PIDPressure = True`	Too many processes/threads on the node (a fork bomb or a leaky workload)	`kubectl describe node <node>`	Find and fix the offending pod; set pod PID limits; add capacity
Pods `Evicted` en masse on one node	The pressure conditions above triggered kubelet eviction by priority/over-request	`kubectl get pods -A -o wide \| grep Evicted`; node Conditions	Resolve the pressure; set requests + PriorityClass so critical pods survive
Node `Ready` but `SchedulingDisabled`	The node was cordoned (often left over from a drain/upgrade)	`kubectl get nodes` (shows `SchedulingDisabled`)	`kubectl uncordon <node>` once it’s healthy
New pods avoid a node	A taint (e.g. `node.kubernetes.io/disk-pressure`, or a custom one) repels pods without the matching toleration	`kubectl describe node <node>` (Taints)	Remove the taint if stale, or add the toleration to pods that should run there
Whole node’s pods unreachable	Node-level network/CNI failure, or the node is partitioned from the cluster network	`kubectl get nodes -o wide`; check CNI DaemonSet pods on that node (`kubectl get pods -n kube-system -o wide`)	Restart/repair the CNI agent on the node; check the underlying network/security groups

A clean repair sequence for a node you need to take out of rotation safely: cordon → drain → fix → uncordon. kubectl cordon <node> stops new pods landing; kubectl drain <node> --ignore-daemonsets --delete-emptydir-data evicts the existing ones respecting PodDisruptionBudgets; you fix or reboot the host; then kubectl uncordon <node> returns it to service. Skipping cordon means new pods keep landing on a sick node while you work on it.

Networking: Services, DNS, NetworkPolicy & Ingress

Networking is where troubleshooting gets subtle, because there are several independent layers — Service-to-pod selection, cluster DNS, NetworkPolicy, and (north-south) Ingress — and a failure in any one looks like “it can’t connect”. Isolate the layer first.

Symptom	Likely cause	Diagnostic command	Fix
Service has no endpoints (connections refused/time out)	Service selector doesn’t match any pod labels, or matching pods aren’t Ready (failing readiness), or `targetPort` is wrong	`kubectl get endpointslices -l kubernetes.io/service-name=<svc>` (or `kubectl get endpoints <svc>`) shows empty; compare `kubectl describe svc <svc>` Selector with `kubectl get pods --show-labels`	Align the selector to the pod labels (or vice versa); fix readiness so pods become Ready; correct `targetPort` to the container’s actual port
Endpoints exist but still can’t connect	`targetPort`/`containerPort` mismatch, app listening on `127.0.0.1` not `0.0.0.0`, or a NetworkPolicy dropping it	`kubectl exec <client-pod> -- curl -sS <svc>:<port>`; `kubectl debug` a netshoot pod; check the app’s bind address	Fix the port mapping; make the app listen on `0.0.0.0`; check policies (below)
DNS resolution fails (`Name or service not found`)	CoreDNS down/overloaded, wrong `dnsPolicy`, or a NetworkPolicy blocking egress to kube-dns on port 53	`kubectl run -it --rm dnsutils --image=registry.k8s.io/e2e-test-images/agnhost:2.45 -- nslookup kubernetes.default`; `kubectl get pods -n kube-system -l k8s-app=kube-dns`; `kubectl logs -n kube-system -l k8s-app=kube-dns`	Restore CoreDNS (scale/resources/restart); allow egress to kube-dns (UDP+TCP 53) in your NetworkPolicy; verify `/etc/resolv.conf` search domains
Cross-namespace name doesn’t resolve	Using a short name across namespaces	`nslookup <svc>.<ns>.svc.cluster.local` from a client pod	Use the FQDN `<svc>.<namespace>.svc.cluster.local` (short names only resolve within the same namespace)
NetworkPolicy silently drops traffic	A default-deny policy is in effect with no matching allow rule; CNI doesn’t enforce policy; or egress (incl. DNS) not allowed	`kubectl get networkpolicy -A`; `kubectl describe networkpolicy <np>`; test with/without the policy (`kubectl exec … curl`)	Add an explicit ingress/egress allow rule (remember DNS egress on 53); confirm your CNI enforces NetworkPolicy (flannel doesn’t; Calico/Cilium do)
Policy “applied” but has no effect	CNI plugin doesn’t support NetworkPolicy	`kubectl get pods -n kube-system` (which CNI?); CNI docs	Switch to/confirm a policy-enforcing CNI (Calico, Cilium) — the API object is accepted even if nothing enforces it
Ingress 502 / 503 Bad Gateway	Backend Service has no endpoints or wrong port, app not Ready, controller can’t reach pods, or backend protocol mismatch (HTTP vs HTTPS)	`kubectl get endpointslices` for the backend Service; `kubectl logs -n <ns> <ingress-controller-pod>`; `kubectl describe ingress <ing>`	Fix the backend (endpoints/port/readiness as above — a 502 is usually the Service behind it, not the Ingress); set the backend-protocol annotation if the app speaks HTTPS
Ingress 404 / wrong backend	Host/path rules don’t match, missing `ingressClassName`, or no default backend	`kubectl describe ingress <ing>` (rules); controller logs	Correct host/path rules; set `ingressClassName`; ensure the controller watches that class
External traffic never arrives (`LoadBalancer` Service `<pending>`)	No cloud LB provisioner (bare-metal/local), or quota/permission issue	`kubectl describe svc <svc>` (Events)	Install MetalLB (bare-metal) or use NodePort/port-forward locally; check cloud quotas/IAM in cloud

The single most common networking bug, by a wide margin, is “no endpoints” — and it is always one of three things: the selector doesn’t match the labels, the pods aren’t Ready, or the port is wrong. Make kubectl get endpointslices (or kubectl get endpoints <svc>) your first move on any “can’t connect” report; an empty endpoint list instantly tells you the problem is selection/readiness, while a populated list pushes you toward policy or ports.

Storage: PVCs, mounts & access modes

Storage failures cluster at two moments: binding (a PersistentVolumeClaim that never gets a PersistentVolume) and mounting (a bound PVC that won’t attach to the node). The split tells you where to look — kubectl get pvc for binding, the pod’s Events for mounting.

Symptom	Likely cause	Diagnostic command	Fix
PVC stuck `Pending`	No default StorageClass and none named; no matching PV for static provisioning; requested size/access mode unavailable	`kubectl describe pvc <pvc>` (Events: “no persistent volumes available”, “no storage class”); `kubectl get storageclass`	Set/name a StorageClass (`storageClassName`); mark one default (`storageclass.kubernetes.io/is-default-class: "true"`); for static PVs, create a PV that matches size + access mode
PVC `Pending` with `WaitForFirstConsumer`	StorageClass uses volume binding mode `WaitForFirstConsumer` — binds only once a pod schedules	`kubectl describe pvc <pvc>` (normal “waiting for first consumer”)	This is expected; create/schedule the pod that uses it and it binds (avoids zone mismatch)
Pod stuck `ContainerCreating` with `FailedMount`	Volume can’t attach/mount: wrong zone (PV in a different AZ than the node), CSI driver issue, or fs problem	`kubectl describe pod <pod>` (Events: `FailedAttachVolume`, `FailedMount`, “volume node affinity conflict”)	Match pod scheduling to the volume’s zone (topology); check the CSI driver pods (`kubectl get pods -n kube-system`); for zone conflict use `WaitForFirstConsumer`
`Multi-Attach error` on `FailedMount`	A `ReadWriteOnce` volume is being attached to a second node (e.g. rolling update before the old pod detaches)	`kubectl describe pod <pod>` (“Multi-Attach error for volume”)	Use `Recreate` strategy for RWO volumes, or a `ReadWriteMany` storage class if you truly need multi-node access; wait for the old pod to fully terminate
App can’t write (read-only filesystem)	Wrong access mode, volume mounted `readOnly: true`, or `fsGroup`/permissions wrong	`kubectl get pvc <pvc>` (access mode); pod `volumeMounts` (`readOnly`); `kubectl exec … ls -ld <path>`	Use the right access mode (RWO/ROX/RWX), drop `readOnly`, set `securityContext.fsGroup` so the app’s user owns the mount
Two pods can’t share a volume	Storage class only supports RWO, not RWX	`kubectl get pvc` (access modes); StorageClass/provisioner capability	Use an RWX-capable backend (NFS/CephFS/Azure Files); most block storage is RWO only
PVC won’t delete / stuck `Terminating`	A pod still uses it, or a finalizer (`kubernetes.io/pvc-protection`) holds it	`kubectl describe pvc <pvc>` (finalizers, “still being used by pod”)	Delete the consuming pod first; the protection finalizer releases once nothing mounts it
Data gone after pod restart	Used `emptyDir` (ephemeral) or a `Deployment` instead of a `StatefulSet` for stateful data	Inspect `volumes` (emptyDir vs PVC)	Use a PVC (and a StatefulSet with `volumeClaimTemplates`) for data that must survive restarts/rescheduling

The reflex: kubectl get pvc first. Bound means binding succeeded and any remaining failure is at mount time (look at the pod’s Events). Pending means binding failed and you look at StorageClasses and describe pvc. That one branch saves you from debugging mounts when the volume never even bound.

RBAC & admission: when the API server says no

These failures are different in flavour — the cluster is working, it is refusing you. The two sources are authorization (RBAC says you lack permission) and admission control (a validating/mutating webhook or a policy engine rejects the object). Crucially, RBAC is purely additive: there are no deny rules, so Forbidden always means a permission was never granted, not that something explicitly blocked you.

Symptom	Likely cause	Diagnostic command	Fix
`Error … is forbidden: User "X" cannot <verb> <resource>`	No Role/ClusterRole grants that subject the verb+resource (+namespace) — permission simply not granted	`kubectl auth can-i <verb> <resource> -n <ns> --as <user>` (or `--as system:serviceaccount:<ns>:<sa>`) returns `no`	Create/extend a Role/ClusterRole with the verb+resource and bind it (RoleBinding/ClusterRoleBinding) to the subject
A pod’s app gets 403 from the Kubernetes API	The pod’s ServiceAccount lacks RBAC (or you assumed `default` SA has rights — it has almost none)	`kubectl auth can-i <verb> <resource> --as system:serviceaccount:<ns>:<sa>`; check `serviceAccountName` on the pod	Bind a Role to that ServiceAccount; set `serviceAccountName` explicitly; grant least privilege, not `cluster-admin`
`Forbidden` only in one namespace	You bound a Role (namespaced) where you needed cluster-wide, or bound in the wrong namespace	`kubectl get rolebinding,clusterrolebinding -A -o wide \| grep <subject>`	Use a ClusterRole + ClusterRoleBinding for cluster-wide, or add a RoleBinding in the missing namespace
Can read but not modify (get works, create/delete forbidden)	Role grants only read verbs (`get`/`list`/`watch`), missing `create`/`update`/`patch`/`delete`	`kubectl auth can-i --list -n <ns> --as <subject>` (lists everything the subject can do)	Add the write verbs to the Role
`admission webhook "…" denied the request`	A validating webhook/policy engine (Kyverno, Gatekeeper/OPA) rejected the object for violating a policy	The error message names the webhook and reason; `kubectl get validatingwebhookconfigurations`; engine logs/policy reports	Make the manifest comply (add the required label/securityContext/etc.), or fix/exempt the policy if it’s wrong
Everything fails with `Internal error … webhook … connection refused/timeout`	A webhook’s backing Service/pod is down and its `failurePolicy: Fail` blocks all matching API calls	`kubectl get pods -n <webhook-ns>`; `kubectl describe validatingwebhookconfiguration <name>`	Restore the webhook’s pod/Service; as a break-glass, scope or temporarily remove the webhook config (it can wedge the whole cluster)
Object created but mutated unexpectedly (extra labels/sidecar)	A mutating webhook (e.g. a mesh injector, Kyverno mutate) changed it on admission	`kubectl get mutatingwebhookconfigurations`; compare submitted vs stored object	Expected behaviour if intended; otherwise adjust/exempt the mutating policy
Pod rejected: `violates PodSecurity "restricted"`	Pod Security Admission is enforcing a standard the pod doesn’t meet	`kubectl get ns <ns> --show-labels` (`pod-security.kubernetes.io/enforce=…`); error names the violation	Add the required `securityContext` (drop capabilities, `runAsNonRoot`, seccomp), or relax the namespace label if appropriate

The one command to internalise here is kubectl auth can-i. With --as it impersonates a user or ServiceAccount and asks the API server whether an action is allowed — so you can reproduce a Forbidden decision exactly, for the precise subject and namespace, without redeploying anything. kubectl auth can-i --list --as system:serviceaccount:<ns>:<sa> dumps everything that subject can do, which makes “what’s missing?” obvious.

Hands-on lab: break it, diagnose it, fix it

You’ll plant several classic faults on a free local cluster, then walk each through the loop. Everything here is local — no cloud account, no charges.

1. Create a cluster (free / local):

kind create cluster --name ts-lab
kubectl config use-context kind-ts-lab
kubectl get nodes          # Expect: one node, STATUS Ready

2. Fault A — ImagePullBackOff. Deploy a bad image tag:

kubectl create deployment web --image=nginx:doesnotexist
kubectl get pods                       # STATUS: ImagePullBackOff / ErrImagePull
kubectl describe pod -l app=web | sed -n '/Events/,$p'   # Events name the pull error

Fix it by setting a real tag, then verify:

kubectl set image deployment/web nginx=nginx:1.27
kubectl rollout status deployment/web                    # successfully rolled out
kubectl get pods                                         # Running / 1/1 Ready

3. Fault B — Service with no endpoints. Create a Service whose selector is wrong:

kubectl expose deployment web --port=80 --selector=app=frontend   # mismatched on purpose
kubectl get endpoints web              # <none>  ← the tell
kubectl describe svc web | grep Selector
kubectl get pods --show-labels         # pods are app=web, not app=frontend

Fix by aligning the selector and confirm endpoints appear:

kubectl patch svc web -p '{"spec":{"selector":{"app":"web"}}}'
kubectl get endpoints web              # now lists a pod IP:80
kubectl run probe --rm -it --image=busybox:1.36 --restart=Never -- wget -qO- web   # serves HTML

4. Fault C — CrashLoopBackOff. Run a container that exits immediately:

kubectl run crasher --image=busybox:1.36 -- /bin/sh -c "echo starting; exit 1"
kubectl get pod crasher                # CrashLoopBackOff, RESTARTS climbing
kubectl logs crasher --previous        # 'starting' — the last crashed run
kubectl describe pod crasher | grep -A2 "Last State"   # Terminated, Exit Code: 1

The fix is to correct the command so the process stays up (here, run a long-lived process); the lesson is the --previous reflex.

5. Fault D — Pending (unschedulable). Request more CPU than the node has:

kubectl run hog --image=nginx:1.27 --overrides='{"spec":{"containers":[{"name":"hog","image":"nginx:1.27","resources":{"requests":{"cpu":"64"}}}]}}'
kubectl get pod hog                     # Pending
kubectl describe pod hog | grep -A3 Events   # "Insufficient cpu"

Fix by lowering the request (kubectl delete pod hog then recreate with cpu: 100m) and watch it schedule.

6. Fault E — PVC Pending (no StorageClass match). Request an impossible class:

kubectl apply -f - <<'EOF'
apiVersion: v1
kind: PersistentVolumeClaim
metadata: { name: stuck }
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: nonexistent
  resources: { requests: { storage: 1Gi } }
EOF
kubectl get pvc stuck                   # Pending
kubectl describe pvc stuck | grep -A3 Events    # no storage class / no volumes
kubectl get storageclass                # kind ships 'standard' (default)

Fix by using the real default class (storageClassName: standard, or omit it) and re-apply; the PVC binds.

Validation. A clean run shows: each fault reproduced with the exact STATUS/Event the playbook predicts; each diagnosis made from describe/logs/endpoints (not guessing); and each fix verified (Running/Ready, endpoints populated, Bound).

Cleanup (so nothing is left running):

kind delete cluster --name ts-lab

Cost note: free / local. kind runs the entire cluster in Docker on your laptop — no cloud account, no charges.

Common mistakes & troubleshooting

A meta-table — the mistakes engineers make while troubleshooting, which keep them stuck:

Mistake	Why it bites	Do this instead
Reading `logs` without `--previous` on a crash-loop	The current container is fresh; the failed run’s logs are gone	`kubectl logs <pod> --previous` whenever `RESTARTS > 0`
Ignoring the Events section of `describe`	Kubernetes literally tells you the reason there	Always scroll to Events first; add `kubectl get events --sort-by=.lastTimestamp`
Restarting/deleting pods before diagnosing	Destroys the evidence (and may “fix” it without you learning the cause)	Diagnose first; reproduce in the lab if you must
Assuming the `default` ServiceAccount has permissions	It has almost none; in-cluster API calls 403	Bind a least-privilege Role to an explicit SA
Forgetting DNS egress in a default-deny NetworkPolicy	Everything “can’t resolve”; looks like DNS is broken	Always add egress to kube-dns on UDP+TCP 53
Blaming the Ingress for a 502	The 502 almost always means the backend Service has no endpoints	Check `kubectl get endpointslices` for the backend first
Bumping memory limits to “fix” OOMKilled forever	Masks a leak; wastes capacity	Set request+limit, then fix the app / right-size with VPA
Debugging a mount when the PVC never bound	Wrong layer entirely	`kubectl get pvc` first: `Pending` = binding, `Bound` = mount

Best practices

Make failures legible before they happen. Set resource requests and limits so the scheduler and OOM behaviour are predictable; add readiness, liveness, and startup probes so “up” means “serving” and slow starts aren’t killed.
Standardise the loop. Teach the team the six steps and the four commands; put a one-page runbook (this lesson’s tables) in your repo so on-call doesn’t improvise at 3 a.m.
Keep events. API Events expire in ~1 hour — ship them and pod logs to a logging stack so you can investigate after the pod is gone. Add kube-state-metrics and alert on CrashLoopBackOff, Pending age, and node Conditions.
Use kubectl debug (ephemeral containers) to attach a toolbox to distroless/running pods instead of baking debug tools into production images.
Default-deny, then allow. Run NetworkPolicies default-deny with explicit allows (including DNS) so connectivity is intentional and “no route” is a known state, not a mystery.
Right-size from data. Use VPA recommendations (even in recommend-only mode) to set requests/limits, so you stop the OOM-bump cycle.

Security notes

Troubleshooting touches the cluster’s trust boundaries, so do it safely. kubectl exec/debug into a pod is privileged access to that workload — gate it with RBAC and audit it; on the CKS exam and in production, the ability to exec is a real escalation path. Be careful with --grace-period=0 --force deletes: they remove the API object even if the pod still runs on a partitioned node, which can cause split-brain for stateful apps — use it only when you’ve confirmed the node is truly gone. Treat admission webhooks as cluster-critical: a webhook with failurePolicy: Fail whose backend is down can wedge all matching API operations, so monitor webhook health and know the break-glass (scope or remove the webhook config). When you debug RBAC, resist the temptation to grant cluster-admin to make the error go away — reproduce with kubectl auth can-i --as and grant the minimum verb/resource. And never paste Secrets into logs or tickets while debugging (kubectl get secret -o yaml is base64, not encryption). These themes are developed in RBAC least-privilege design, Pod Security Admission, and default-deny NetworkPolicies.

Interview & exam questions

A pod is in CrashLoopBackOff. Walk me through your steps. kubectl get pods to see RESTARTS; kubectl describe pod for Events and the Last State exit code; kubectl logs <pod> --previous for the failed run’s output. Decide: is it the app exiting non-zero (fix config/command/env) or a liveness probe killing it (fix the probe / add a startupProbe)? Fix the manifest, re-apply, kubectl rollout status.
What’s the difference between ImagePullBackOff and CrashLoopBackOff? ImagePullBackOff is a runtime/pull problem — the node never got the image (bad tag, auth, rate limit); the container never started. CrashLoopBackOff means the image pulled and the container started then exited repeatedly — an application/config problem. describe Events distinguish them immediately.
A Service has no endpoints. What’s the single most likely cause and the one command that confirms it? The selector doesn’t match any Ready pods (label mismatch or failing readiness). Confirm with kubectl get endpoints <svc> (empty), then compare kubectl describe svc Selector against kubectl get pods --show-labels.
An app gets 403 Forbidden from the Kubernetes API. RBAC has no deny rules — so what does that error actually mean, and how do you reproduce it? Since RBAC is purely additive, Forbidden means no binding ever granted that subject the verb/resource/namespace. Reproduce with kubectl auth can-i <verb> <resource> -n <ns> --as system:serviceaccount:<ns>:<sa>; fix by binding a least-privilege Role to that ServiceAccount.
A pod is Pending. Which one command tells you why, and where do you read it? kubectl describe pod <pod> — the Events section states the scheduling reason verbatim (“Insufficient cpu”, “untolerated taint”, “didn’t match node selector”, “volume node affinity conflict”).
OOMKilled — what happened and what’s the exit code? The container exceeded its memory limit (or the node OOM-killer chose it); exit code 137 (SIGKILL), shown in Last State: Terminated, Reason: OOMKilled. Fix by setting an adequate memory request+limit and addressing the app’s memory use — not by bumping the limit forever.
Your Ingress returns 502. Where do you look first? At the backend Service’s endpoints (kubectl get endpointslices), because a 502 almost always means the controller has no healthy backend — pods not Ready, wrong targetPort, or empty selector. Only after that do you suspect the Ingress rules, class, or backend protocol.
DNS lookups fail inside pods. How do you debug it? Run a test pod and nslookup kubernetes.default; check CoreDNS (kubectl get pods -n kube-system -l k8s-app=kube-dns and its logs). A frequent cause is a default-deny NetworkPolicy with no DNS egress — add egress to kube-dns on UDP+TCP 53.
A PVC is stuck Pending. What are the usual causes? No default/named StorageClass, no matching static PV (size/access mode), or WaitForFirstConsumer (expected until a pod schedules). kubectl describe pvc Events and kubectl get storageclass tell you which.
You see Multi-Attach error on a pod’s volume during a rolling update. Why? A ReadWriteOnce volume can attach to only one node; a RollingUpdate tried to start the new pod (new node) before the old pod detached. Use the Recreate strategy for RWO volumes, or an RWX storage class if multi-node access is genuinely required.
What does kubectl debug give you that kubectl exec doesn’t? Ephemeral containers — you attach a toolbox image to a running pod (including distroless images with no shell) or create a debug copy, without rebuilding or restarting the workload. Ideal for production pods that ship without debug tools.
How do you safely take a node out of service to repair it, without disrupting workloads? kubectl cordon (stop new pods) → kubectl drain --ignore-daemonsets --delete-emptydir-data (evict existing, respecting PodDisruptionBudgets) → fix/reboot → kubectl uncordon. Skipping cordon lets new pods keep landing on the sick node.

Quick check

A container has RESTARTS: 7. Which single flag on kubectl logs shows you why it last died, and why?
kubectl get endpoints my-svc prints <none>. Name the two most likely causes and the command that distinguishes them.
A pod shows Last State: Terminated, Reason: OOMKilled, Exit Code: 137. What’s the cause, and what’s the right fix (not the lazy one)?
RBAC has no deny rules. Given that, what does Error: forbidden always mean, and which command reproduces the decision for a specific ServiceAccount?
Your kubectl get pvc shows Pending. Are you looking at a binding or a mount problem, and which two things do you check?

Answers

--previous — kubectl logs <pod> --previous shows the last crashed container’s output; the current container is a fresh restart with no useful logs.
Either the selector doesn’t match the pod labels, or the pods aren’t Ready (failing readiness). Distinguish by comparing kubectl describe svc my-svc Selector with kubectl get pods --show-labels (label mismatch) versus kubectl get pods (none Ready).
The container exceeded its memory limit (exit 137 = SIGKILL by the OOM-killer). The right fix is to set an adequate memory request + limit and address the app’s memory use (leak/heap), right-sizing from VPA data — not just raising the limit until it stops.
Because RBAC is purely additive, it means no binding ever granted that subject the verb/resource/namespace — the permission was simply never created. Reproduce with kubectl auth can-i <verb> <resource> -n <ns> --as system:serviceaccount:<ns>:<sa>.
A binding problem (the PV never attached because the claim never bound). Check kubectl describe pvc Events and kubectl get storageclass (missing default/named class, no matching PV, or expected WaitForFirstConsumer). Bound would mean you’d instead debug the mount via the pod’s Events.

Exercise

Build your own break-and-fix runbook (timed, free, local). On a fresh kind cluster, plant one fault per layer and prove you can diagnose each from observation alone.

Create the cluster (kind create cluster --name runbook).
Plant five faults, one per layer: Pod (a wrong image tag → ImagePullBackOff), Node (cordon the node, then deploy something → Pending), Networking (a Service with a deliberately mismatched selector → no endpoints), Storage (a PVC with a non-existent storageClassName → Pending), and RBAC (a pod using a ServiceAccount with no permissions calling the API → 403).
For each, write down the layer, the diagnostic command, the exact Event/error, the root cause, and the fix — before you fix it. Time yourself: under 6 minutes per fault.
Fix each via an edited manifest re-applied (not a manual hot-patch), and verify (Running/Ready, endpoints populated, Bound, auth can-i … yes).

Self-assess:

Criterion	Target
Identified the correct layer before touching anything	All 5
Found root cause from `describe`/`logs`/`endpoints`/`auth can-i` (not guessing)	All 5
Fixed via edited manifest, re-applied	All 5
Verified the symptom is actually gone	All 5
Whole drill completed	Under 30 minutes

Cleanup: kind delete cluster --name runbook.

Cost note: free / local — the whole exercise runs in Docker on your laptop.

Certification mapping

CKA — troubleshooting is the single largest domain (~30%) of the exam. This lesson is core: node NotReady triage, Service/endpoint and DNS debugging, PVC/storage failures, and RBAC Forbidden reproduction with kubectl auth can-i all appear as live-cluster tasks. The cordon/drain/uncordon and “fix the broken pod” patterns are exam staples.
CKAD — the app-facing half: diagnosing CrashLoopBackOff/ImagePullBackOff/OOMKilled, probes vs readiness/endpoints, ConfigMap/Secret wiring, and using --previous and describe fast. Speed on these is the exam.
CKS — the security-adjacent failures: Pod Security Admission rejections, admission-webhook denials (Kyverno/Gatekeeper), default-deny NetworkPolicy drops (including DNS egress), and least-privilege RBAC. Knowing how a webhook with failurePolicy: Fail can wedge the cluster is exactly the operational-security thinking CKS probes.

Glossary

CrashLoopBackOff — a pod state where the container repeatedly starts, exits, and is restarted with increasing back-off delay; an application/probe problem, not scheduling.
ImagePullBackOff / ErrImagePull — the node cannot pull the image (bad tag, missing imagePullSecret, registry rate limit, wrong arch); the container never starts.
OOMKilled — the container was killed (SIGKILL, exit 137) for exceeding its memory limit or under node memory pressure.
Pending — the scheduler has not placed the pod on a node, usually due to resources, taints, affinity/topology, or an unbound PVC.
Evicted — the kubelet removed a running pod because of node pressure (disk/memory/PID), by priority and over-request.
Node Conditions — kubelet-reported node health flags: Ready, MemoryPressure, DiskPressure, PIDPressure.
Cordon / drain / uncordon — mark a node unschedulable / evict its pods (respecting PDBs) / return it to service.
Endpoints / EndpointSlice — the list of Ready pod IPs a Service routes to, derived from its label selector; “no endpoints” means the selector matched nothing Ready.
CoreDNS — the cluster DNS server (in kube-system); failures or blocked egress to it on port 53 break in-cluster name resolution.
NetworkPolicy — namespaced ingress/egress firewall rules for pods; enforced only by a policy-capable CNI (Calico, Cilium), additive allows on top of default-deny.
PVC / PV / StorageClass — a claim for storage / the provisioned volume / the template that dynamically provisions PVs; a Pending PVC means binding hasn’t happened.
Access modes — ReadWriteOnce (one node), ReadOnlyMany, ReadWriteMany (multi-node) — the cause of Multi-Attach errors when mismatched.
RBAC — Role-Based Access Control; Roles/ClusterRoles + (Cluster)RoleBindings, purely additive with no deny rules.
kubectl auth can-i — asks the API server whether a subject may perform an action; with --as it impersonates a user/ServiceAccount to reproduce a decision.
Admission webhook — a validating/mutating HTTP callback (or policy engine like Kyverno/Gatekeeper) that can reject or modify objects on creation; a Fail policy with a dead backend can block all matching API calls.
Ephemeral container / kubectl debug — a temporary toolbox container attached to a running pod for debugging, including distroless images with no shell.

Next steps

You can now diagnose the everyday failures across every layer. The next lesson steps up from “a pod is broken” to “the cluster is broken” — control-plane and etcd incidents, and structured root-cause analysis:

Advanced Kubernetes Troubleshooting: Control-Plane, etcd & Complex Incident RCA — apiserver outages, etcd quorum loss and restore, cascading failures, and blameless postmortems.
Kubernetes RBAC: Least-Privilege Design — the deep dive behind the RBAC/admission playbook.
Designing Zero-Trust Pod Networking: Default-Deny NetworkPolicies and Cilium L7-Aware Rules — so “NetworkPolicy dropped it” is always a known state, not a mystery.
Kubernetes Interview & Certification Prep: KCNA / CKAD / CKA / CKS Roadmap — turn this troubleshooting fluency into a certification.