Containerization Troubleshooting

Kubernetes Troubleshooting Playbooks: Pods, Nodes, Networking, Storage & RBAC

The difference between an engineer who has run Kubernetes for a year and one who is still nervous around it is rarely knowledge of obscure features. It is a method. When a pod is stuck and a manager is asking for an ETA, the strong engineer does not guess — they run a short, fixed sequence of commands that narrows the problem to one layer, form a hypothesis, prove it, fix it, and verify. Everyone else restarts things and hopes.

This lesson gives you that method and then turns it into playbooks: for each common failure you will get a table of symptom → likely cause → the diagnostic command that confirms it → the fix. We cover the failures you genuinely hit in production and on the CKA/CKS exams — pods that crash-loop, fail to pull images, get OOM-killed, sit Pending or get Evicted; nodes that go NotReady; Services with no endpoints, DNS that fails, NetworkPolicies that silently drop traffic, and Ingress controllers returning 502; storage that never binds or never mounts; and the RBAC Forbidden and admission-webhook denials that block deploys. By the end you will diagnose any of these on a free local cluster without guessing.

Learning objectives

By the end of this lesson you can:

Prerequisites & where this fits

You need a working kubectl, a cluster you can break (a free local one is ideal — kind, minikube, or k3d), and the core object model from earlier in the course: Pods, ReplicaSets, Deployments & Services and the control-plane/node architecture. It also helps to have met RBAC, NetworkPolicies, and CSI storage, though we re-explain each failure from first principles. This is the Troubleshooting lesson of the Kubernetes Zero-to-Hero course: the bridge between knowing the objects and operating them under pressure. The next lesson steps up to cluster-level incidents and root-cause analysis. Everything here runs on free, local tooling — no cloud account, no charges.

The method: a loop, not a guess

Almost every Kubernetes problem yields to the same six-step loop. The discipline is to follow it in order rather than jumping to a fix you have used before.

  1. Observe. Look at reality before forming any opinion. kubectl get for the high-level state, then kubectl describe for the detail and Events, then kubectl logs for what the app itself said. Read the cluster’s own event stream with kubectl get events --sort-by=.lastTimestamp. Resist the urge to change anything yet.
  2. Isolate the layer. Kubernetes is layered — scheduling → image/runtime → application → networking → storage → policy/auth. Decide which layer the symptom lives in. A pod that never gets a node is a scheduling problem; a pod that runs but a Service can’t reach it is a networking problem. Naming the layer eliminates 80% of the search space.
  3. Compare desired vs actual. Kubernetes is declarative: every controller is trying to make actual match desired. Most bugs are a gap between the two — a Service selector that doesn’t match the pod labels, a replica count that the scheduler can’t satisfy, an image tag that doesn’t exist. Put the spec next to the live state and look for the mismatch.
  4. Form one hypothesis. State it as a falsifiable sentence: “the Service has no endpoints because its selector is app=web but the pods are labelled app=frontend.” A vague hunch (“networking is broken”) cannot be tested; a specific claim can.
  5. Fix — the smallest change that tests the hypothesis. Edit the manifest and re-apply (kubectl apply -f), not a manual hot-patch you’ll forget. If the fix proves the hypothesis, you’ve also found the root cause. If it doesn’t, you’ve cheaply eliminated one possibility — go back to step 4.
  6. Verify, then prevent. Confirm the symptom is gone, not just that “it looks better”: kubectl get pods all Running/Ready, kubectl get endpoints populated, kubectl rollout status complete. Then ask the prevention question: what guardrail (a probe, a resource request, a default-deny policy with the right allow-rule, a CI check) stops this recurring?

Kubernetes troubleshooting decision tree

The decision tree above encodes step 2: start from the symptom at the top, branch on “is the pod scheduled / running / reachable?”, and each leaf points you at the playbook below. When you are under pressure, walking the tree keeps you honest about which layer you are in before you touch anything.

The four commands that answer everything

You can solve the large majority of problems with four commands. Know precisely what each one tells you — this is what stops the flailing.

Command Question it answers Where to look
kubectl get <resource> -o wide What is the high-level state? Which node, which IP, how many ready? STATUS, RESTARTS, READY, NODE columns
kubectl describe <resource> Why is it in that state? The Events section at the bottom — it states reasons verbatim
kubectl logs <pod> [-c <container>] [--previous] What did the application itself say? App stdout/stderr; --previous shows the last crashed container
kubectl get events --sort-by=.lastTimestamp What has the cluster been doing recently, across objects? Chronological reasons: scheduling, pulls, probe failures, evictions

A few high-leverage habits: alias kubectl to k and turn on completion; use -o wide by default so you always see the node and IP; reach for --previous the instant you see RESTARTS > 0; and remember that describe reads Events from the API, but they expire (~1 hour) — for older history use kubectl get events or your logging stack. For deeper inspection, kubectl get <res> -o yaml shows the full live object including .status, and kubectl debug (ephemeral containers, GA since v1.25) lets you attach a toolbox container to a running or distroless pod without rebuilding the image.

Pods: the workload-level playbook

Most incidents are pod-level, and the STATUS column plus kubectl describe Events almost always name the cause. The mental split is scheduling vs runtime vs application: Pending is scheduling (the pod has no node yet); ImagePullBackOff is runtime (the node can’t get the image); CrashLoopBackOff and OOMKilled are application/resource (the container started and then died); Evicted is node pressure pushing a pod off after it ran.

Symptom (STATUS / signal) Likely cause Diagnostic command Fix
CrashLoopBackOff App exits non-zero on start (bad config, missing env/secret, failed migration, wrong command), or a too-aggressive liveness probe kills it before it’s ready kubectl logs <pod> --previous; kubectl describe pod <pod> (exit code, Last State: Terminated, probe events) Fix the app error the logs show (supply the env/Secret/ConfigMap, correct the command/args); if it’s the probe, fix the probe target/port or add a startupProbe so slow starts aren’t killed
ImagePullBackOff / ErrImagePull Image name/tag typo, tag doesn’t exist, private registry without imagePullSecrets, registry rate-limited, or wrong architecture kubectl describe pod <pod> — Events show the exact pull error (manifest unknown, unauthorized, 429 Too Many Requests) Correct the image/tag; add an imagePullSecret (kubectl create secret docker-registry …) and reference it on the pod/ServiceAccount; for rate limits, authenticate or use a pull-through cache
OOMKilled (in Last State, exit code 137) Container exceeded its memory limit (or the node ran out and the kernel OOM-killer chose it) kubectl describe pod <pod>Last State: Terminated, Reason: OOMKilled; kubectl top pod <pod> Raise the memory limit (and set a matching request); fix the app’s leak/heap; right-size with VPA recommendations — don’t just keep bumping the limit blindly
Pending Scheduler can’t place it: insufficient CPU/memory on any node, an untolerated taint, unsatisfiable node/pod affinity or topology spread, or an unbound PVC kubectl describe pod <pod> — Events: “Insufficient cpu/memory”, “had untolerated taint”, “didn’t match node selector”, “had volume node affinity conflict” Lower the request, add a toleration/the right nodeSelector/label, add capacity (or let Cluster Autoscaler), or fix the PVC (see Storage). kubectl get nodes -o wide and kubectl describe node show allocatable headroom
Pending with “0/N nodes available” and no obvious resource line No schedulable nodes (all cordoned/NotReady), or pod requests a resource (GPU, specific zone) nothing offers kubectl get nodes; kubectl describe pod Events Uncordon/repair nodes (see Nodes); relax the constraint or provision matching capacity
Evicted (pod text shows “The node was low on resource: …”) Node pressureDiskPressure/MemoryPressure triggered the kubelet to evict lower-priority/over-request pods kubectl get pod <pod> -o yamlstatus.message; kubectl describe node <node> (Conditions) Clear node pressure (free disk/images, fix the noisy pod); set proper requests and a sensible PriorityClass so critical pods aren’t evicted first; delete the dead Evicted objects (kubectl delete pod --field-selector status.phase=Failed)
Init:Error / Init:CrashLoopBackOff An init container is failing (waiting on a dependency, bad migration, missing config) kubectl logs <pod> -c <init-container>; kubectl describe pod shows which init step is stuck Fix that init container’s command/config or the dependency it waits for; init containers run to completion in order before app containers start
ContainerCreating (stuck) Volume not attaching/mounting, missing Secret/ConfigMap referenced as a volume, CNI not assigning an IP, or image still pulling kubectl describe pod <pod> — Events name it (FailedMount, FailedCreatePodSandBox, secret “not found”) Create the missing Secret/ConfigMap; resolve the volume issue (see Storage); check CNI health on the node
Completed but expected to keep running Container’s main process exited 0 (script finished, wrong command, ran as a one-shot) kubectl logs <pod>; check spec.containers[].command Make it a long-running process or use a Job/CronJob if it’s genuinely one-shot
Pod Running but not Ready (0/1) Readiness probe failing — app up but not serving, dependency down, wrong probe path/port kubectl describe pod <pod> (Readiness probe events); kubectl logs <pod> Fix the probe target or the dependency; a failing readiness probe (correctly) keeps the pod out of Service endpoints — that is the link to “no endpoints” below
Terminating (stuck for minutes) A finalizer is blocking deletion, or the node is gone and the pod is orphaned kubectl get pod <pod> -o yamlmetadata.finalizers, deletionTimestamp Resolve/remove the finalizer’s owner; for a dead node, kubectl delete pod --grace-period=0 --force (last resort)

Two reflexes worth burning in. First, --previous for any crash: CrashLoopBackOff means the container already died and restarted, so the current container has no useful logs — kubectl logs <pod> --previous shows the failed run. Second, exit codes are a shortcut: 137 is SIGKILL (usually OOM or a liveness kill), 143 is SIGTERM (graceful shutdown), 1/2 are generic app errors, and 126/127 mean the command isn’t executable / not found.

Nodes: when the machine, not the pod, is the problem

When many pods on one node misbehave at once — evictions, ContainerCreating, sudden NotReady pods — suspect the node, not the workloads. A node has a small set of Conditions the kubelet reports, and reading them is the whole game.

Symptom Likely cause Diagnostic command Fix
Node NotReady kubelet stopped/crashed, lost API-server connectivity, CNI not ready, or the host is down kubectl describe node <node> (Conditions, last heartbeat); on the host: systemctl status kubelet, journalctl -u kubelet Restart/repair the kubelet; fix networking to the control plane; reinstall/repair the CNI; if hardware, drain and replace
MemoryPressure = True Node out of allocatable memory; kubelet starts evicting kubectl describe node <node>; kubectl top node Reduce load (evict/reschedule), set proper requests, add nodes; find the offender with kubectl top pod -A --sort-by=memory
DiskPressure = True Disk/inode exhaustion — often image bloat or container logs filling the disk kubectl describe node <node>; on host df -h, crictl images kubelet garbage-collects images/containers automatically; if not enough, prune images, rotate logs, grow the disk, move /var/lib/containerd to bigger storage
PIDPressure = True Too many processes/threads on the node (a fork bomb or a leaky workload) kubectl describe node <node> Find and fix the offending pod; set pod PID limits; add capacity
Pods Evicted en masse on one node The pressure conditions above triggered kubelet eviction by priority/over-request kubectl get pods -A -o wide | grep Evicted; node Conditions Resolve the pressure; set requests + PriorityClass so critical pods survive
Node Ready but SchedulingDisabled The node was cordoned (often left over from a drain/upgrade) kubectl get nodes (shows SchedulingDisabled) kubectl uncordon <node> once it’s healthy
New pods avoid a node A taint (e.g. node.kubernetes.io/disk-pressure, or a custom one) repels pods without the matching toleration kubectl describe node <node> (Taints) Remove the taint if stale, or add the toleration to pods that should run there
Whole node’s pods unreachable Node-level network/CNI failure, or the node is partitioned from the cluster network kubectl get nodes -o wide; check CNI DaemonSet pods on that node (kubectl get pods -n kube-system -o wide) Restart/repair the CNI agent on the node; check the underlying network/security groups

A clean repair sequence for a node you need to take out of rotation safely: cordon → drain → fix → uncordon. kubectl cordon <node> stops new pods landing; kubectl drain <node> --ignore-daemonsets --delete-emptydir-data evicts the existing ones respecting PodDisruptionBudgets; you fix or reboot the host; then kubectl uncordon <node> returns it to service. Skipping cordon means new pods keep landing on a sick node while you work on it.

Networking: Services, DNS, NetworkPolicy & Ingress

Networking is where troubleshooting gets subtle, because there are several independent layers — Service-to-pod selection, cluster DNS, NetworkPolicy, and (north-south) Ingress — and a failure in any one looks like “it can’t connect”. Isolate the layer first.

Symptom Likely cause Diagnostic command Fix
Service has no endpoints (connections refused/time out) Service selector doesn’t match any pod labels, or matching pods aren’t Ready (failing readiness), or targetPort is wrong kubectl get endpointslices -l kubernetes.io/service-name=<svc> (or kubectl get endpoints <svc>) shows empty; compare kubectl describe svc <svc> Selector with kubectl get pods --show-labels Align the selector to the pod labels (or vice versa); fix readiness so pods become Ready; correct targetPort to the container’s actual port
Endpoints exist but still can’t connect targetPort/containerPort mismatch, app listening on 127.0.0.1 not 0.0.0.0, or a NetworkPolicy dropping it kubectl exec <client-pod> -- curl -sS <svc>:<port>; kubectl debug a netshoot pod; check the app’s bind address Fix the port mapping; make the app listen on 0.0.0.0; check policies (below)
DNS resolution fails (Name or service not found) CoreDNS down/overloaded, wrong dnsPolicy, or a NetworkPolicy blocking egress to kube-dns on port 53 kubectl run -it --rm dnsutils --image=registry.k8s.io/e2e-test-images/agnhost:2.45 -- nslookup kubernetes.default; kubectl get pods -n kube-system -l k8s-app=kube-dns; kubectl logs -n kube-system -l k8s-app=kube-dns Restore CoreDNS (scale/resources/restart); allow egress to kube-dns (UDP+TCP 53) in your NetworkPolicy; verify /etc/resolv.conf search domains
Cross-namespace name doesn’t resolve Using a short name across namespaces nslookup <svc>.<ns>.svc.cluster.local from a client pod Use the FQDN <svc>.<namespace>.svc.cluster.local (short names only resolve within the same namespace)
NetworkPolicy silently drops traffic A default-deny policy is in effect with no matching allow rule; CNI doesn’t enforce policy; or egress (incl. DNS) not allowed kubectl get networkpolicy -A; kubectl describe networkpolicy <np>; test with/without the policy (kubectl exec … curl) Add an explicit ingress/egress allow rule (remember DNS egress on 53); confirm your CNI enforces NetworkPolicy (flannel doesn’t; Calico/Cilium do)
Policy “applied” but has no effect CNI plugin doesn’t support NetworkPolicy kubectl get pods -n kube-system (which CNI?); CNI docs Switch to/confirm a policy-enforcing CNI (Calico, Cilium) — the API object is accepted even if nothing enforces it
Ingress 502 / 503 Bad Gateway Backend Service has no endpoints or wrong port, app not Ready, controller can’t reach pods, or backend protocol mismatch (HTTP vs HTTPS) kubectl get endpointslices for the backend Service; kubectl logs -n <ns> <ingress-controller-pod>; kubectl describe ingress <ing> Fix the backend (endpoints/port/readiness as above — a 502 is usually the Service behind it, not the Ingress); set the backend-protocol annotation if the app speaks HTTPS
Ingress 404 / wrong backend Host/path rules don’t match, missing ingressClassName, or no default backend kubectl describe ingress <ing> (rules); controller logs Correct host/path rules; set ingressClassName; ensure the controller watches that class
External traffic never arrives (LoadBalancer Service <pending>) No cloud LB provisioner (bare-metal/local), or quota/permission issue kubectl describe svc <svc> (Events) Install MetalLB (bare-metal) or use NodePort/port-forward locally; check cloud quotas/IAM in cloud

The single most common networking bug, by a wide margin, is “no endpoints” — and it is always one of three things: the selector doesn’t match the labels, the pods aren’t Ready, or the port is wrong. Make kubectl get endpointslices (or kubectl get endpoints <svc>) your first move on any “can’t connect” report; an empty endpoint list instantly tells you the problem is selection/readiness, while a populated list pushes you toward policy or ports.

Storage: PVCs, mounts & access modes

Storage failures cluster at two moments: binding (a PersistentVolumeClaim that never gets a PersistentVolume) and mounting (a bound PVC that won’t attach to the node). The split tells you where to look — kubectl get pvc for binding, the pod’s Events for mounting.

Symptom Likely cause Diagnostic command Fix
PVC stuck Pending No default StorageClass and none named; no matching PV for static provisioning; requested size/access mode unavailable kubectl describe pvc <pvc> (Events: “no persistent volumes available”, “no storage class”); kubectl get storageclass Set/name a StorageClass (storageClassName); mark one default (storageclass.kubernetes.io/is-default-class: "true"); for static PVs, create a PV that matches size + access mode
PVC Pending with WaitForFirstConsumer StorageClass uses volume binding mode WaitForFirstConsumer — binds only once a pod schedules kubectl describe pvc <pvc> (normal “waiting for first consumer”) This is expected; create/schedule the pod that uses it and it binds (avoids zone mismatch)
Pod stuck ContainerCreating with FailedMount Volume can’t attach/mount: wrong zone (PV in a different AZ than the node), CSI driver issue, or fs problem kubectl describe pod <pod> (Events: FailedAttachVolume, FailedMount, “volume node affinity conflict”) Match pod scheduling to the volume’s zone (topology); check the CSI driver pods (kubectl get pods -n kube-system); for zone conflict use WaitForFirstConsumer
Multi-Attach error on FailedMount A ReadWriteOnce volume is being attached to a second node (e.g. rolling update before the old pod detaches) kubectl describe pod <pod> (“Multi-Attach error for volume”) Use Recreate strategy for RWO volumes, or a ReadWriteMany storage class if you truly need multi-node access; wait for the old pod to fully terminate
App can’t write (read-only filesystem) Wrong access mode, volume mounted readOnly: true, or fsGroup/permissions wrong kubectl get pvc <pvc> (access mode); pod volumeMounts (readOnly); kubectl exec … ls -ld <path> Use the right access mode (RWO/ROX/RWX), drop readOnly, set securityContext.fsGroup so the app’s user owns the mount
Two pods can’t share a volume Storage class only supports RWO, not RWX kubectl get pvc (access modes); StorageClass/provisioner capability Use an RWX-capable backend (NFS/CephFS/Azure Files); most block storage is RWO only
PVC won’t delete / stuck Terminating A pod still uses it, or a finalizer (kubernetes.io/pvc-protection) holds it kubectl describe pvc <pvc> (finalizers, “still being used by pod”) Delete the consuming pod first; the protection finalizer releases once nothing mounts it
Data gone after pod restart Used emptyDir (ephemeral) or a Deployment instead of a StatefulSet for stateful data Inspect volumes (emptyDir vs PVC) Use a PVC (and a StatefulSet with volumeClaimTemplates) for data that must survive restarts/rescheduling

The reflex: kubectl get pvc first. Bound means binding succeeded and any remaining failure is at mount time (look at the pod’s Events). Pending means binding failed and you look at StorageClasses and describe pvc. That one branch saves you from debugging mounts when the volume never even bound.

RBAC & admission: when the API server says no

These failures are different in flavour — the cluster is working, it is refusing you. The two sources are authorization (RBAC says you lack permission) and admission control (a validating/mutating webhook or a policy engine rejects the object). Crucially, RBAC is purely additive: there are no deny rules, so Forbidden always means a permission was never granted, not that something explicitly blocked you.

Symptom Likely cause Diagnostic command Fix
Error … is forbidden: User "X" cannot <verb> <resource> No Role/ClusterRole grants that subject the verb+resource (+namespace) — permission simply not granted kubectl auth can-i <verb> <resource> -n <ns> --as <user> (or --as system:serviceaccount:<ns>:<sa>) returns no Create/extend a Role/ClusterRole with the verb+resource and bind it (RoleBinding/ClusterRoleBinding) to the subject
A pod’s app gets 403 from the Kubernetes API The pod’s ServiceAccount lacks RBAC (or you assumed default SA has rights — it has almost none) kubectl auth can-i <verb> <resource> --as system:serviceaccount:<ns>:<sa>; check serviceAccountName on the pod Bind a Role to that ServiceAccount; set serviceAccountName explicitly; grant least privilege, not cluster-admin
Forbidden only in one namespace You bound a Role (namespaced) where you needed cluster-wide, or bound in the wrong namespace kubectl get rolebinding,clusterrolebinding -A -o wide | grep <subject> Use a ClusterRole + ClusterRoleBinding for cluster-wide, or add a RoleBinding in the missing namespace
Can read but not modify (get works, create/delete forbidden) Role grants only read verbs (get/list/watch), missing create/update/patch/delete kubectl auth can-i --list -n <ns> --as <subject> (lists everything the subject can do) Add the write verbs to the Role
admission webhook "…" denied the request A validating webhook/policy engine (Kyverno, Gatekeeper/OPA) rejected the object for violating a policy The error message names the webhook and reason; kubectl get validatingwebhookconfigurations; engine logs/policy reports Make the manifest comply (add the required label/securityContext/etc.), or fix/exempt the policy if it’s wrong
Everything fails with Internal error … webhook … connection refused/timeout A webhook’s backing Service/pod is down and its failurePolicy: Fail blocks all matching API calls kubectl get pods -n <webhook-ns>; kubectl describe validatingwebhookconfiguration <name> Restore the webhook’s pod/Service; as a break-glass, scope or temporarily remove the webhook config (it can wedge the whole cluster)
Object created but mutated unexpectedly (extra labels/sidecar) A mutating webhook (e.g. a mesh injector, Kyverno mutate) changed it on admission kubectl get mutatingwebhookconfigurations; compare submitted vs stored object Expected behaviour if intended; otherwise adjust/exempt the mutating policy
Pod rejected: violates PodSecurity "restricted" Pod Security Admission is enforcing a standard the pod doesn’t meet kubectl get ns <ns> --show-labels (pod-security.kubernetes.io/enforce=…); error names the violation Add the required securityContext (drop capabilities, runAsNonRoot, seccomp), or relax the namespace label if appropriate

The one command to internalise here is kubectl auth can-i. With --as it impersonates a user or ServiceAccount and asks the API server whether an action is allowed — so you can reproduce a Forbidden decision exactly, for the precise subject and namespace, without redeploying anything. kubectl auth can-i --list --as system:serviceaccount:<ns>:<sa> dumps everything that subject can do, which makes “what’s missing?” obvious.

Hands-on lab: break it, diagnose it, fix it

You’ll plant several classic faults on a free local cluster, then walk each through the loop. Everything here is local — no cloud account, no charges.

1. Create a cluster (free / local):

kind create cluster --name ts-lab
kubectl config use-context kind-ts-lab
kubectl get nodes          # Expect: one node, STATUS Ready

2. Fault A — ImagePullBackOff. Deploy a bad image tag:

kubectl create deployment web --image=nginx:doesnotexist
kubectl get pods                       # STATUS: ImagePullBackOff / ErrImagePull
kubectl describe pod -l app=web | sed -n '/Events/,$p'   # Events name the pull error

Fix it by setting a real tag, then verify:

kubectl set image deployment/web nginx=nginx:1.27
kubectl rollout status deployment/web                    # successfully rolled out
kubectl get pods                                         # Running / 1/1 Ready

3. Fault B — Service with no endpoints. Create a Service whose selector is wrong:

kubectl expose deployment web --port=80 --selector=app=frontend   # mismatched on purpose
kubectl get endpoints web              # <none>  ← the tell
kubectl describe svc web | grep Selector
kubectl get pods --show-labels         # pods are app=web, not app=frontend

Fix by aligning the selector and confirm endpoints appear:

kubectl patch svc web -p '{"spec":{"selector":{"app":"web"}}}'
kubectl get endpoints web              # now lists a pod IP:80
kubectl run probe --rm -it --image=busybox:1.36 --restart=Never -- wget -qO- web   # serves HTML

4. Fault C — CrashLoopBackOff. Run a container that exits immediately:

kubectl run crasher --image=busybox:1.36 -- /bin/sh -c "echo starting; exit 1"
kubectl get pod crasher                # CrashLoopBackOff, RESTARTS climbing
kubectl logs crasher --previous        # 'starting' — the last crashed run
kubectl describe pod crasher | grep -A2 "Last State"   # Terminated, Exit Code: 1

The fix is to correct the command so the process stays up (here, run a long-lived process); the lesson is the --previous reflex.

5. Fault D — Pending (unschedulable). Request more CPU than the node has:

kubectl run hog --image=nginx:1.27 --overrides='{"spec":{"containers":[{"name":"hog","image":"nginx:1.27","resources":{"requests":{"cpu":"64"}}}]}}'
kubectl get pod hog                     # Pending
kubectl describe pod hog | grep -A3 Events   # "Insufficient cpu"

Fix by lowering the request (kubectl delete pod hog then recreate with cpu: 100m) and watch it schedule.

6. Fault E — PVC Pending (no StorageClass match). Request an impossible class:

kubectl apply -f - <<'EOF'
apiVersion: v1
kind: PersistentVolumeClaim
metadata: { name: stuck }
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: nonexistent
  resources: { requests: { storage: 1Gi } }
EOF
kubectl get pvc stuck                   # Pending
kubectl describe pvc stuck | grep -A3 Events    # no storage class / no volumes
kubectl get storageclass                # kind ships 'standard' (default)

Fix by using the real default class (storageClassName: standard, or omit it) and re-apply; the PVC binds.

Validation. A clean run shows: each fault reproduced with the exact STATUS/Event the playbook predicts; each diagnosis made from describe/logs/endpoints (not guessing); and each fix verified (Running/Ready, endpoints populated, Bound).

Cleanup (so nothing is left running):

kind delete cluster --name ts-lab

Cost note: free / local. kind runs the entire cluster in Docker on your laptop — no cloud account, no charges.

Common mistakes & troubleshooting

A meta-table — the mistakes engineers make while troubleshooting, which keep them stuck:

Mistake Why it bites Do this instead
Reading logs without --previous on a crash-loop The current container is fresh; the failed run’s logs are gone kubectl logs <pod> --previous whenever RESTARTS > 0
Ignoring the Events section of describe Kubernetes literally tells you the reason there Always scroll to Events first; add kubectl get events --sort-by=.lastTimestamp
Restarting/deleting pods before diagnosing Destroys the evidence (and may “fix” it without you learning the cause) Diagnose first; reproduce in the lab if you must
Assuming the default ServiceAccount has permissions It has almost none; in-cluster API calls 403 Bind a least-privilege Role to an explicit SA
Forgetting DNS egress in a default-deny NetworkPolicy Everything “can’t resolve”; looks like DNS is broken Always add egress to kube-dns on UDP+TCP 53
Blaming the Ingress for a 502 The 502 almost always means the backend Service has no endpoints Check kubectl get endpointslices for the backend first
Bumping memory limits to “fix” OOMKilled forever Masks a leak; wastes capacity Set request+limit, then fix the app / right-size with VPA
Debugging a mount when the PVC never bound Wrong layer entirely kubectl get pvc first: Pending = binding, Bound = mount

Best practices

Security notes

Troubleshooting touches the cluster’s trust boundaries, so do it safely. kubectl exec/debug into a pod is privileged access to that workload — gate it with RBAC and audit it; on the CKS exam and in production, the ability to exec is a real escalation path. Be careful with --grace-period=0 --force deletes: they remove the API object even if the pod still runs on a partitioned node, which can cause split-brain for stateful apps — use it only when you’ve confirmed the node is truly gone. Treat admission webhooks as cluster-critical: a webhook with failurePolicy: Fail whose backend is down can wedge all matching API operations, so monitor webhook health and know the break-glass (scope or remove the webhook config). When you debug RBAC, resist the temptation to grant cluster-admin to make the error go away — reproduce with kubectl auth can-i --as and grant the minimum verb/resource. And never paste Secrets into logs or tickets while debugging (kubectl get secret -o yaml is base64, not encryption). These themes are developed in RBAC least-privilege design, Pod Security Admission, and default-deny NetworkPolicies.

Interview & exam questions

  1. A pod is in CrashLoopBackOff. Walk me through your steps. kubectl get pods to see RESTARTS; kubectl describe pod for Events and the Last State exit code; kubectl logs <pod> --previous for the failed run’s output. Decide: is it the app exiting non-zero (fix config/command/env) or a liveness probe killing it (fix the probe / add a startupProbe)? Fix the manifest, re-apply, kubectl rollout status.
  2. What’s the difference between ImagePullBackOff and CrashLoopBackOff? ImagePullBackOff is a runtime/pull problem — the node never got the image (bad tag, auth, rate limit); the container never started. CrashLoopBackOff means the image pulled and the container started then exited repeatedly — an application/config problem. describe Events distinguish them immediately.
  3. A Service has no endpoints. What’s the single most likely cause and the one command that confirms it? The selector doesn’t match any Ready pods (label mismatch or failing readiness). Confirm with kubectl get endpoints <svc> (empty), then compare kubectl describe svc Selector against kubectl get pods --show-labels.
  4. An app gets 403 Forbidden from the Kubernetes API. RBAC has no deny rules — so what does that error actually mean, and how do you reproduce it? Since RBAC is purely additive, Forbidden means no binding ever granted that subject the verb/resource/namespace. Reproduce with kubectl auth can-i <verb> <resource> -n <ns> --as system:serviceaccount:<ns>:<sa>; fix by binding a least-privilege Role to that ServiceAccount.
  5. A pod is Pending. Which one command tells you why, and where do you read it? kubectl describe pod <pod> — the Events section states the scheduling reason verbatim (“Insufficient cpu”, “untolerated taint”, “didn’t match node selector”, “volume node affinity conflict”).
  6. OOMKilled — what happened and what’s the exit code? The container exceeded its memory limit (or the node OOM-killer chose it); exit code 137 (SIGKILL), shown in Last State: Terminated, Reason: OOMKilled. Fix by setting an adequate memory request+limit and addressing the app’s memory use — not by bumping the limit forever.
  7. Your Ingress returns 502. Where do you look first? At the backend Service’s endpoints (kubectl get endpointslices), because a 502 almost always means the controller has no healthy backend — pods not Ready, wrong targetPort, or empty selector. Only after that do you suspect the Ingress rules, class, or backend protocol.
  8. DNS lookups fail inside pods. How do you debug it? Run a test pod and nslookup kubernetes.default; check CoreDNS (kubectl get pods -n kube-system -l k8s-app=kube-dns and its logs). A frequent cause is a default-deny NetworkPolicy with no DNS egress — add egress to kube-dns on UDP+TCP 53.
  9. A PVC is stuck Pending. What are the usual causes? No default/named StorageClass, no matching static PV (size/access mode), or WaitForFirstConsumer (expected until a pod schedules). kubectl describe pvc Events and kubectl get storageclass tell you which.
  10. You see Multi-Attach error on a pod’s volume during a rolling update. Why? A ReadWriteOnce volume can attach to only one node; a RollingUpdate tried to start the new pod (new node) before the old pod detached. Use the Recreate strategy for RWO volumes, or an RWX storage class if multi-node access is genuinely required.
  11. What does kubectl debug give you that kubectl exec doesn’t? Ephemeral containers — you attach a toolbox image to a running pod (including distroless images with no shell) or create a debug copy, without rebuilding or restarting the workload. Ideal for production pods that ship without debug tools.
  12. How do you safely take a node out of service to repair it, without disrupting workloads? kubectl cordon (stop new pods) → kubectl drain --ignore-daemonsets --delete-emptydir-data (evict existing, respecting PodDisruptionBudgets) → fix/reboot → kubectl uncordon. Skipping cordon lets new pods keep landing on the sick node.

Quick check

  1. A container has RESTARTS: 7. Which single flag on kubectl logs shows you why it last died, and why?
  2. kubectl get endpoints my-svc prints <none>. Name the two most likely causes and the command that distinguishes them.
  3. A pod shows Last State: Terminated, Reason: OOMKilled, Exit Code: 137. What’s the cause, and what’s the right fix (not the lazy one)?
  4. RBAC has no deny rules. Given that, what does Error: forbidden always mean, and which command reproduces the decision for a specific ServiceAccount?
  5. Your kubectl get pvc shows Pending. Are you looking at a binding or a mount problem, and which two things do you check?

Answers

  1. --previouskubectl logs <pod> --previous shows the last crashed container’s output; the current container is a fresh restart with no useful logs.
  2. Either the selector doesn’t match the pod labels, or the pods aren’t Ready (failing readiness). Distinguish by comparing kubectl describe svc my-svc Selector with kubectl get pods --show-labels (label mismatch) versus kubectl get pods (none Ready).
  3. The container exceeded its memory limit (exit 137 = SIGKILL by the OOM-killer). The right fix is to set an adequate memory request + limit and address the app’s memory use (leak/heap), right-sizing from VPA data — not just raising the limit until it stops.
  4. Because RBAC is purely additive, it means no binding ever granted that subject the verb/resource/namespace — the permission was simply never created. Reproduce with kubectl auth can-i <verb> <resource> -n <ns> --as system:serviceaccount:<ns>:<sa>.
  5. A binding problem (the PV never attached because the claim never bound). Check kubectl describe pvc Events and kubectl get storageclass (missing default/named class, no matching PV, or expected WaitForFirstConsumer). Bound would mean you’d instead debug the mount via the pod’s Events.

Exercise

Build your own break-and-fix runbook (timed, free, local). On a fresh kind cluster, plant one fault per layer and prove you can diagnose each from observation alone.

  1. Create the cluster (kind create cluster --name runbook).

  2. Plant five faults, one per layer: Pod (a wrong image tag → ImagePullBackOff), Node (cordon the node, then deploy something → Pending), Networking (a Service with a deliberately mismatched selector → no endpoints), Storage (a PVC with a non-existent storageClassNamePending), and RBAC (a pod using a ServiceAccount with no permissions calling the API → 403).

  3. For each, write down the layer, the diagnostic command, the exact Event/error, the root cause, and the fixbefore you fix it. Time yourself: under 6 minutes per fault.

  4. Fix each via an edited manifest re-applied (not a manual hot-patch), and verify (Running/Ready, endpoints populated, Bound, auth can-i … yes).

  5. Self-assess:

    Criterion Target
    Identified the correct layer before touching anything All 5
    Found root cause from describe/logs/endpoints/auth can-i (not guessing) All 5
    Fixed via edited manifest, re-applied All 5
    Verified the symptom is actually gone All 5
    Whole drill completed Under 30 minutes
  6. Cleanup: kind delete cluster --name runbook.

Cost note: free / local — the whole exercise runs in Docker on your laptop.

Certification mapping

Glossary

Next steps

You can now diagnose the everyday failures across every layer. The next lesson steps up from “a pod is broken” to “the cluster is broken” — control-plane and etcd incidents, and structured root-cause analysis:

KubernetesTroubleshootingkubectlCKACKSSRE
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading