Lesson 21 of 28

Module 6 · Concepts — Reading pod failures like a native speaker

doc

Checking sign-in…

The three failures you must recognise

99% of pod failures you'll see in anger are one of these three:

`CrashLoopBackOff`

Your container's process exited (usually with a non-zero status), Kubernetes restarted it, it exited again, Kubernetes waited longer before retrying, and now it's looping with exponential backoff. The container started; it just didn't stay up.

Common causes:

Application error at startup (missing config, can't reach a dependency, panic/exception)
Wrong command / args (command: ["./app", "--config", "/nope.yaml"])
Failed liveness probe (Kubernetes is killing the container even though it thinks it's healthy)

Where to look: kubectl logs <pod> for the last attempt. kubectl logs <pod> --previous for the attempt before that, if the current attempt has already scrolled past.

`ImagePullBackOff` (and its sibling `ErrImagePull`)

The kubelet couldn't pull your container image. The container never started.

Common causes:

Typo in the image reference (ghcr.io/me/ap:latest instead of app:latest)
Image doesn't exist at that tag (built something different; tagged it v2 but the manifest says v1)
Registry requires credentials you haven't supplied (private GHCR repo without an imagePullSecret)
Rate-limited by Docker Hub (common in CI; your best fix is to use a different registry or authenticate)

Where to look: kubectl describe pod <pod> — the Events section at the bottom will quote the kubelet's exact error.

`OOMKilled`

The container exceeded its memory limit and the Linux kernel killed it. You'll see OOMKilled as the termination reason. On the next restart you'll often also see CrashLoopBackOff because the container keeps running out of memory.

Common causes:

resources.limits.memory is too low for real usage
A memory leak in the application
A JVM / Node / Go process ignoring the container limit and growing past it

Where to look: kubectl describe pod <pod> — look for Reason: OOMKilled under Last State. Then check kubectl top pod for current memory use, and kubectl get events -n <ns> --field-selector involvedObject.name=<pod> for the exact moment it got killed.

The debugging loop

Regardless of which failure you're looking at, the loop is the same:

kubectl get pods -n <ns> — what state is the pod in? How many restarts?
kubectl describe pod <pod> -n <ns> — what do the Events say? What's the Last State?
kubectl logs <pod> -n <ns> (and --previous for the prior attempt)
If still stuck: kubectl exec -it <pod> -- /bin/sh to shell inside (won't work if the container never started or is dying; use an ephemeral debug container instead: kubectl debug -n <ns> <pod> -it --image=busybox)

Internalise this loop. It's the single most valuable reflex for running services in K8s.

Two more you'll encounter

Pending — the scheduler can't place the Pod on any node. Almost always resources (not enough CPU or memory on any node) or a node selector / taint mismatch. describe pod tells you exactly which.
CreateContainerConfigError — the kubelet couldn't start the container because the config it got is broken: a referenced Secret or ConfigMap doesn't exist, a volume can't mount, environment variable source is missing. describe pod again.

Why `describe` before `logs`

New engineers reach for logs first. It's often the wrong move:

For ImagePullBackOff, Pending, and CreateContainerConfigError, there are no logs yet — the container never started.
For OOMKilled, logs from the killed attempt may be truncated by the kernel before flush.
describe always has the Event stream, and Events are the kubelet and scheduler's shared diary of "what I tried and what went wrong."

Rule: describe first for why it's not running. logs for what it said while running.

Relevant links

Debug pods — K8s docs
kubectl debug — ephemeral debug containers, priceless when your image is distroless and has no shell.
The Kubernetes Failure Stories repo — real-world postmortems. Read a few; pattern-match the failures.

View source documentation →

Module 6 · Concepts — Reading pod failures like a native speaker

The three failures you must recognise

CrashLoopBackOff

ImagePullBackOff (and its sibling ErrImagePull)

OOMKilled

The debugging loop

Two more you'll encounter

Why describe before logs

Relevant links

`CrashLoopBackOff`

`ImagePullBackOff` (and its sibling `ErrImagePull`)

`OOMKilled`

Why `describe` before `logs`