Lesson 22 of 28

Module 6 · Task — Reproduce and fix the three canonical pod failures (via Claude)

doc

Checking sign-in…

The task

Drive Claude to deliberately produce CrashLoopBackOff, ImagePullBackOff, and OOMKilled — one pod per failure. For each, diagnose using the debugging loop (get pods → describe → logs / logs --previous), then fix. The practice run builds the reflex; in production this is the only skill that matters when your pager fires at 2am.

Important note on this module's shape: the whole module is a structured "break it on purpose" exercise. The Break it on purpose section you've met in every previous task lesson is baked into the task itself here — that's why this module gets a fourth, adversarial scenario instead.

Acceptance test: for each of the three failure scenarios, you can (a) state what kubectl describe will show before you look, (b) produce the fix, (c) confirm the pod reaches Running state. No peeking at the fix until you've tried.

Setup

kind cluster running, kubectl config current-context is kind-devops-ready or similar.
The demo namespace exists (kubectl create namespace demo if not).

Work each scenario in order. Drive Claude through the break + diagnose loop; write the fix yourself before you let Claude apply it, so you build the muscle rather than borrow it.

Scenario 1 — `CrashLoopBackOff`

Break it

Send Claude:

"Apply a Deployment named crash in the demo namespace with one replica of busybox:1.36 whose command is sh -c 'echo starting; sleep 2; echo bailing; exit 1'. Don't fix anything yet. Confirm the pod enters CrashLoopBackOff."

Diagnose

Send Claude:

"Run the standard debugging loop: get pods, describe pod -l app=crash, logs -l app=crash, logs -l app=crash --previous. Show me each output. Don't propose a fix yet."

Read each output yourself. Notice: the State section of describe shows Last State: Terminated, Reason: Error, Exit Code: 1. The --previous logs are what tell you why the container died — the live logs would be empty because the container just restarted.

Fix — write it yourself first

Before you let Claude patch, write down (one line): what command would I run to make this pod stop crashing? Now send Claude your fix as a prompt:

"Patch the crash Deployment to replace its args with ['echo running; sleep infinity'] and watch pods until one is Running."

Confirm. Clean up: kubectl delete -f ... (or kubectl delete deployment crash -n demo).

Scenario 2 — `ImagePullBackOff`

Break it

Send Claude:

"Apply a Deployment badimage in demo with image ghcr.io/stefanprodan/podinfo:does-not-exist and containerPort: 9898. Confirm it lands in ImagePullBackOff."

Diagnose

Send:

"Run kubectl -n demo describe pod -l app=badimage | tail -20 and show me the Events."

Read the Events. You'll see Failed to pull image ... manifest unknown and ErrImagePull. kubectl logs returns nothing useful because the container never started.

Fix

Before Claude patches, write your fix down. Then prompt:

"Set the image on the badimage deployment to ghcr.io/stefanprodan/podinfo:6.6.2 and confirm the pod reaches Running."

Scenario 3 — `OOMKilled`

Break it

Send Claude:

"Apply a Deployment oom in demo with image polinux/stress, resources.limits.memory: 32Mi, command stress with args [--vm, 1, --vm-bytes, 128M, --vm-hang, 1]. Observe OOMKilled."

Diagnose

Send:

"Describe the pod, filter for Last State and exit code."

You'll see Reason: OOMKilled, Exit Code: 137. 137 = 128 + SIGKILL(9) — the signature the Linux OOM killer leaves on every OOM. Memorise it.

Fix

Your fix: raise the memory limit to match the actual workload. Prompt:

"Patch the oom Deployment memory limit from 32Mi to 256Mi and confirm Running."

Scenario 4 (adversarial) — debug a pod that's actually healthy

Every previous task lesson had a Break it on purpose section. Module 6 is already all about breaking things — so the adversarial probe here is the opposite: run the debugging loop on a pod that is perfectly fine and notice what the diagnostic commands look like when there's no bug.

Deploy podinfo again (module 1's manifest will do). Wait for it to be Ready.

Then:

Run kubectl -n demo describe pod -l app=podinfo. Read the output end to end. What do the Events look like when a pod is healthy? What does State say? What does Conditions list?
Run kubectl -n demo logs -l app=podinfo and kubectl -n demo logs -l app=podinfo --previous. Note how --previous errors or returns empty when there is no previous container instance.
Predict-then-observe: if someone handed you this describe output in a slack thread, how would you tell, from the output alone, that nothing is wrong?

Knowing the shape of a healthy pod is how you recognise the shape of a broken one. Intern-level debugging is running the loop. Senior-level debugging is knowing the baseline so the deviation jumps out.

The self-assessment — the actual acceptance test

For each of scenarios 1–3, write down, without re-reading the task:

CrashLoopBackOff — first kubectl command? Most likely root cause you'd expect?
ImagePullBackOff — first command? Which section of the output tells you why?
OOMKilled — first command? What's the exit code signature?

If you can answer all three without looking, you've got the muscle memory. If you can't, rerun the scenarios.

What to keep for the next lesson

Keep the three broken manifests (crash.yaml, badimage.yaml, oom.yaml — Claude wrote them), the describe + logs outputs from each scenario, and your healthy-baseline notes from scenario 4. In the next lesson you'll codify .claude/skills/k8s-pod-debug-triage/ — a skill that, given a pod name, runs the debugging loop and names the failure mode. The skill's boundary statement will explicitly say what it won't diagnose (network policies, admission webhook rejections, scheduling issues under resource pressure) — those are the cases where a fresh session should hand back to a human.

Module 6 · Task — Reproduce and fix the three canonical pod failures (via Claude)

The task

Setup

Scenario 1 — CrashLoopBackOff

Break it

Diagnose

Fix — write it yourself first

Scenario 2 — ImagePullBackOff

Break it

Diagnose

Fix

Scenario 3 — OOMKilled

Break it

Diagnose

Fix

Scenario 4 (adversarial) — debug a pod that's actually healthy

The self-assessment — the actual acceptance test

What to keep for the next lesson

Scenario 1 — `CrashLoopBackOff`

Scenario 2 — `ImagePullBackOff`

Scenario 3 — `OOMKilled`