Lesson 21 of 28
Module 6 · Concepts — Reading pod failures like a native speaker
The three failures you must recognise
99% of pod failures you'll see in anger are one of these three:
CrashLoopBackOff
Your container's process exited (usually with a non-zero status), Kubernetes restarted it, it exited again, Kubernetes waited longer before retrying, and now it's looping with exponential backoff. The container started; it just didn't stay up.
Common causes:
- Application error at startup (missing config, can't reach a dependency, panic/exception)
- Wrong command / args (
command: ["./app", "--config", "/nope.yaml"]) - Failed liveness probe (Kubernetes is killing the container even though it thinks it's healthy)
Where to look: kubectl logs <pod> for the last attempt. kubectl logs <pod> --previous for the attempt before that, if the current attempt has already scrolled past.
ImagePullBackOff (and its sibling ErrImagePull)
The kubelet couldn't pull your container image. The container never started.
Common causes:
- Typo in the image reference (
ghcr.io/me/ap:latestinstead ofapp:latest) - Image doesn't exist at that tag (built something different; tagged it
v2but the manifest saysv1) - Registry requires credentials you haven't supplied (private GHCR repo without an
imagePullSecret) - Rate-limited by Docker Hub (common in CI; your best fix is to use a different registry or authenticate)
Where to look: kubectl describe pod <pod> — the Events section at the bottom will quote the kubelet's exact error.
OOMKilled
The container exceeded its memory limit and the Linux kernel killed it. You'll see OOMKilled as the termination reason. On the next restart you'll often also see CrashLoopBackOff because the container keeps running out of memory.
Common causes:
resources.limits.memoryis too low for real usage- A memory leak in the application
- A JVM / Node / Go process ignoring the container limit and growing past it
Where to look: kubectl describe pod <pod> — look for Reason: OOMKilled under Last State. Then check kubectl top pod for current memory use, and kubectl get events -n <ns> --field-selector involvedObject.name=<pod> for the exact moment it got killed.
The debugging loop
Regardless of which failure you're looking at, the loop is the same:
kubectl get pods -n <ns>— what state is the pod in? How many restarts?kubectl describe pod <pod> -n <ns>— what do the Events say? What's the Last State?kubectl logs <pod> -n <ns>(and--previousfor the prior attempt)- If still stuck:
kubectl exec -it <pod> -- /bin/shto shell inside (won't work if the container never started or is dying; use an ephemeral debug container instead:kubectl debug -n <ns> <pod> -it --image=busybox)
Internalise this loop. It's the single most valuable reflex for running services in K8s.
Two more you'll encounter
Pending— the scheduler can't place the Pod on any node. Almost always resources (not enough CPU or memory on any node) or a node selector / taint mismatch.describe podtells you exactly which.CreateContainerConfigError— the kubelet couldn't start the container because the config it got is broken: a referenced Secret or ConfigMap doesn't exist, a volume can't mount, environment variable source is missing.describe podagain.
Why describe before logs
New engineers reach for logs first. It's often the wrong move:
- For
ImagePullBackOff,Pending, andCreateContainerConfigError, there are no logs yet — the container never started. - For
OOMKilled, logs from the killed attempt may be truncated by the kernel before flush. describealways has the Event stream, and Events are the kubelet and scheduler's shared diary of "what I tried and what went wrong."
Rule: describe first for why it's not running. logs for what it said while running.
Relevant links
- Debug pods — K8s docs
kubectl debug— ephemeral debug containers, priceless when your image is distroless and has no shell.- The Kubernetes Failure Stories repo — real-world postmortems. Read a few; pattern-match the failures.