Lesson 23 of 28

Module 6 · Skill — Codify `k8s-pod-debug-triage`

doc

Checking sign-in…

Why this becomes a skill

You ran the debugging loop against three canonical failures and (crucially) against a healthy pod. You now know what each failure class looks like in describe + logs + logs --previous and — just as important — what a clean baseline looks like. A skill that automates the loop and names the failure class is exactly what you want at 2am when your pager goes off and coherent thought is harder than it should be.

This skill is the most dangerous one in the course to get wrong. A triage tool that confidently misdiagnoses is worse than no tool at all — it sends you down the wrong debugging path while the real incident burns. The boundary statement matters more here than anywhere else in the course.

Codify

Send Claude:

"Codify the session we just had as a skill at .claude/skills/k8s-pod-debug-triage/SKILL.md. Scope: given a namespace + a selector (or pod name), run the debugging loop (get pods, describe, logs, logs --previous) and classify the failure into one of: CrashLoopBackOff, ImagePullBackOff, OOMKilled (exit code 137), CreateContainerConfigError, Healthy (no failure detected), or Unknown — human required. Print the relevant evidence for each classification (the Last State block, the Events section, the log tail, etc.) so the user can verify the classification. Do NOT auto-fix — classification + evidence only. Keep SKILL.md under 160 lines."

Read SKILL.md. Confirm the classification rules are explicit: "if state.waiting.reason == ImagePullBackOff then ImagePullBackOff", not vague "look for pull errors". Triage rules that rely on fuzzy reasoning will misfire in ways that are hard to predict.

Refine

"Add a 'confidence' field to each classification. High confidence when the kubelet Reason field is unambiguous (e.g., OOMKilled, exit code 137). Low confidence when the evidence is suggestive but not definitive (e.g., pod is CrashLoopBackOff but last-exit logs are empty — could be a race, could be a quick-bail, could be a signal). Low confidence means the skill ends with 'human review recommended' and prints the evidence verbatim rather than asserting a root cause."
"Remember adversarial scenario 4 from the task lesson — debugging a pod that's actually healthy. Add a Healthy classification with a rationale template: 'all containers Running, Conditions show Ready=True, no recent restarts, recent Events all Normal'. The skill must resist the temptation to find a problem when there isn't one."
"Add a Known failures outside this skill's scope section that explicitly names: network policy rejection (pod Running but traffic denied), admission webhook rejection at apply-time, scheduling failures (Pending due to resource pressure / no matching node), PVC binding failures, and CrashLoopBackOff with a custom init-container. For each, give one diagnostic command the user should run next. The skill hands these back to a human — it does NOT try to diagnose them."

Validate in a fresh context — happy path AND adversarial

4a — Happy path

New Claude session. Use scenarios 1–3 from the task lesson:

"Read .claude/skills/k8s-pod-debug-triage/SKILL.md and triage the pod in namespace demo with selector app=crash." (Same for app=badimage, app=oom.)

Each run: classification correct, confidence High, evidence shown. If any produces the wrong classification — even once — the skill isn't safe. Fix.

Also run against a healthy pod:

"Triage the pod in namespace demo with selector app=podinfo (which is healthy)."

Expected output: Healthy, with the rationale template. If the skill hallucinates a problem, it has failed the honest-to-its-limits principle and must be fixed.

4b — Adversarial

This skill's adversarial scenarios are the ones most likely to happen in real life but aren't in the original three canonical failures:

Pod with a NetworkPolicy blocking egress, stuck failing health probes. Deploy a pod that starts fine but can't reach its upstream dependency. describe shows Running + ready-probe failures. Does the skill correctly classify as Unknown — human required and surface the readiness probe failures + point the user at kubectl get networkpolicies -n <ns>, or does it confidently misclassify as CrashLoopBackOff?
Admission webhook rejected the Deployment. The Deployment exists but no Pods are being created. kubectl get pods -l app=rejected -n <ns> returns nothing. Does the skill detect "no pods matching selector" and either surface the Deployment's events or hand back to human, or does it hallucinate output?
Pod is fine now but was OOMKilled five minutes ago (one restart, now healthy). Does the skill notice the restart count is 1, surface the previous logs with the OOM signature, and correctly classify as Healthy — with recent OOM event (review memory limits) rather than Healthy flat?

Each of these is a case where a misfire sends the user down the wrong path. Fix any misclassification in section 3 before shipping the skill.

Promote deterministic commands to `scripts/`

The kubectl invocations are deterministic. The classification is judgement and stays in prose / skill logic.

"Extract the evidence-gathering into .claude/skills/k8s-pod-debug-triage/scripts/gather.sh (args: namespace, selector-or-name). Output a structured JSON blob containing: pod name, phase, container statuses (including waiting.reason and lastState.terminated.{reason, exitCode}), recent events (last 5), current log tail (50 lines), previous log tail (50 lines). The skill's classification logic then reads this JSON — decoupling 'what did we find' from 'what does it mean'."

The JSON-structuring matters: it means the classification logic is testable without a live cluster, and you can add new classifications later without re-writing the evidence-gathering.

Know the boundary

This is the section you must get exactly right. At the top of .claude/skills/k8s-pod-debug-triage/SKILL.md:

This skill handles: CrashLoopBackOff (with exit-code evidence), ImagePullBackOff / ErrImagePull, OOMKilled (exit 137), CreateContainerConfigError, healthy pods (explicit Healthy classification), and healthy-with-recent-restart (surfaces the previous terminated event without false alarm). Classification + evidence only — no auto-fix.
This skill does NOT handle: NetworkPolicy-induced traffic failures (pod Running but blocked), admission-webhook rejections (no pods created), scheduling failures (Pending / no matching node), PVC binding failures, init-container failures, sidecar-caused restarts, cloud-LB / Ingress routing issues, or anything where the correct next step is a cluster-wide diagnostic (kubectl get events -A, kubectl describe node). In all these cases the skill classifies as Unknown — human required, prints the evidence, and suggests the next command — it does NOT guess.

You're done when

A fresh Claude session correctly classifies all three canonical failures + the healthy pod AND correctly classifies all three adversarial scenarios (NetworkPolicy, admission webhook, recent-OOM-now-healthy) as either the correct class or Unknown — human required. An honest "I don't know" is a feature, not a failure.