Lesson 22 of 28
Module 6 · Task — Reproduce and fix the three canonical pod failures (via Claude)
The task
Drive Claude to deliberately produce CrashLoopBackOff, ImagePullBackOff, and OOMKilled — one pod per failure. For each, diagnose using the debugging loop (get pods → describe → logs / logs --previous), then fix. The practice run builds the reflex; in production this is the only skill that matters when your pager fires at 2am.
Important note on this module's shape: the whole module is a structured "break it on purpose" exercise. The Break it on purpose section you've met in every previous task lesson is baked into the task itself here — that's why this module gets a fourth, adversarial scenario instead.
Acceptance test: for each of the three failure scenarios, you can (a) state what kubectl describe will show before you look, (b) produce the fix, (c) confirm the pod reaches Running state. No peeking at the fix until you've tried.
Setup
kindcluster running,kubectl config current-contextiskind-devops-readyor similar.- The
demonamespace exists (kubectl create namespace demoif not).
Work each scenario in order. Drive Claude through the break + diagnose loop; write the fix yourself before you let Claude apply it, so you build the muscle rather than borrow it.
Scenario 1 — CrashLoopBackOff
Break it
Send Claude:
"Apply a Deployment named
crashin thedemonamespace with one replica ofbusybox:1.36whose command issh -c 'echo starting; sleep 2; echo bailing; exit 1'. Don't fix anything yet. Confirm the pod enters CrashLoopBackOff."
Diagnose
Send Claude:
"Run the standard debugging loop:
get pods,describe pod -l app=crash,logs -l app=crash,logs -l app=crash --previous. Show me each output. Don't propose a fix yet."
Read each output yourself. Notice: the State section of describe shows Last State: Terminated, Reason: Error, Exit Code: 1. The --previous logs are what tell you why the container died — the live logs would be empty because the container just restarted.
Fix — write it yourself first
Before you let Claude patch, write down (one line): what command would I run to make this pod stop crashing? Now send Claude your fix as a prompt:
"Patch the
crashDeployment to replace its args with['echo running; sleep infinity']and watch pods until one is Running."
Confirm. Clean up: kubectl delete -f ... (or kubectl delete deployment crash -n demo).
Scenario 2 — ImagePullBackOff
Break it
Send Claude:
"Apply a Deployment
badimageindemowith imageghcr.io/stefanprodan/podinfo:does-not-existandcontainerPort: 9898. Confirm it lands inImagePullBackOff."
Diagnose
Send:
"Run
kubectl -n demo describe pod -l app=badimage | tail -20and show me the Events."
Read the Events. You'll see Failed to pull image ... manifest unknown and ErrImagePull. kubectl logs returns nothing useful because the container never started.
Fix
Before Claude patches, write your fix down. Then prompt:
"Set the image on the
badimagedeployment toghcr.io/stefanprodan/podinfo:6.6.2and confirm the pod reaches Running."
Scenario 3 — OOMKilled
Break it
Send Claude:
"Apply a Deployment
oomindemowith imagepolinux/stress,resources.limits.memory: 32Mi, commandstresswith args[--vm, 1, --vm-bytes, 128M, --vm-hang, 1]. Observe OOMKilled."
Diagnose
Send:
"Describe the pod, filter for Last State and exit code."
You'll see Reason: OOMKilled, Exit Code: 137. 137 = 128 + SIGKILL(9) — the signature the Linux OOM killer leaves on every OOM. Memorise it.
Fix
Your fix: raise the memory limit to match the actual workload. Prompt:
"Patch the
oomDeployment memory limit from 32Mi to 256Mi and confirm Running."
Scenario 4 (adversarial) — debug a pod that's actually healthy
Every previous task lesson had a Break it on purpose section. Module 6 is already all about breaking things — so the adversarial probe here is the opposite: run the debugging loop on a pod that is perfectly fine and notice what the diagnostic commands look like when there's no bug.
Deploy podinfo again (module 1's manifest will do). Wait for it to be Ready.
Then:
- Run
kubectl -n demo describe pod -l app=podinfo. Read the output end to end. What do the Events look like when a pod is healthy? What doesStatesay? What doesConditionslist? - Run
kubectl -n demo logs -l app=podinfoandkubectl -n demo logs -l app=podinfo --previous. Note how--previouserrors or returns empty when there is no previous container instance. - Predict-then-observe: if someone handed you this
describeoutput in a slack thread, how would you tell, from the output alone, that nothing is wrong?
Knowing the shape of a healthy pod is how you recognise the shape of a broken one. Intern-level debugging is running the loop. Senior-level debugging is knowing the baseline so the deviation jumps out.
The self-assessment — the actual acceptance test
For each of scenarios 1–3, write down, without re-reading the task:
- CrashLoopBackOff — first
kubectlcommand? Most likely root cause you'd expect? - ImagePullBackOff — first command? Which section of the output tells you why?
- OOMKilled — first command? What's the exit code signature?
If you can answer all three without looking, you've got the muscle memory. If you can't, rerun the scenarios.
What to keep for the next lesson
Keep the three broken manifests (crash.yaml, badimage.yaml, oom.yaml — Claude wrote them), the describe + logs outputs from each scenario, and your healthy-baseline notes from scenario 4. In the next lesson you'll codify .claude/skills/k8s-pod-debug-triage/ — a skill that, given a pod name, runs the debugging loop and names the failure mode. The skill's boundary statement will explicitly say what it won't diagnose (network policies, admission webhook rejections, scheduling issues under resource pressure) — those are the cases where a fresh session should hand back to a human.