learnclaude .dev
All courses

Free course

Kubernetes Operators, Istio, Incident Response & AI

Build a Kubernetes operator with Claude Code, then use it to manage an AI workload, embed AI inside the reconciler itself, and wire it into an incident-response flow on a real service mesh — so the operator is both the managed system and a first responder.

Level
Intermediate
Time
320 min
Prerequisite
Comfort reading YAML and running kubectl against a running cluster. No prior operator-authoring experience needed.
Lessons
16

About this course

Hands-on lab runs on kind + Istio with the Bookinfo sample app, giving you a multi-service mesh you can break on purpose and observe end to end. The takeaway is what the combination of operator + service mesh + AI + Claude Code unlocks, not a generic Kubernetes tutorial.

What you'll learn

  • Explain the operator pattern and when it's the right answer (and when a plain controller isn't enough).
  • Identify the four Kubernetes nouns an operator composes — CRDs, controllers, reconcile loops, and the API server contract.
  • Bootstrap a multi-node kind cluster with Istio and the Bookinfo sample so later lessons have a realistic mesh to break and reason about.
  • Tear the lab down cleanly between sessions — cluster, port-forward, and Docker Desktop — without leaving processes or VMs running in the background.
  • Spot the unique leverage Claude Code gives when authoring an operator vs. reading one.
  • Understand how an operator can both manage an AI workload and embed AI inside the reconcile loop for incident response.
  • Design a provider-pluggable LLM interface in Go and implement it against local Ollama/Gemma and the Anthropic Claude API — with classified errors, ctx-respecting cancellation, and skill files validated in fresh Claude Code sessions.
  • Explain why a Skill belongs in the cluster as a first-class CRD, and how the safety triangle (RBAC + tool whitelist + output filter) bounds what the AI inside a controller is permitted to do.
  • Design a read-only investigation CRD pair (Skill + Investigation), its RBAC envelope, and its trigger-mode ladder — and pressure-test the design with an adversarial probe that walks malicious skill prose through each layer of the safety triangle.
  • Write a domain-specific RootSync investigation skill that's useful on a developer's laptop today AND slots into the in-cluster controller when it ships — one artefact, two homes.
  • Draw the discipline between deterministic controller code and AI synthesis — when the LLM is invoked, what it receives, what it returns, and what the controller still owns regardless of what the LLM said. Most investigations resolve without calling the LLM; the ones that don't are where the org-specific Skill prose earns its keep.
  • Package the whole investigation stack — CRDs, RBAC, default Skill, controller Deployment — as a single kro ResourceGraphDefinition that platform users install with one kubectl apply, while keeping the AI-in-loop reconcile out of kro (where CEL can't reach) and in the Go controller (where it belongs).

Syllabus

320m total

Skills you'll build

4 reusable Claude Code skills. Each is validated in a fresh session against both valid and adversarial input, and documents what it handles as well as what it does not.

  • .claude/skills/ operator-investigation-design/
  • .claude/skills/ rootsync-investigation/
  • .claude/skills/ ai-invocation-discipline/
  • .claude/skills/ kro-platform-packaging/
What each skill handles 4 boundary statements
  1. .claude/skills/operator-investigation-design/

    Handles: designing read-only Investigation CRDs paired with a Skill CRD, the RBAC envelope for the controller, schema-level CEL validation, the trigger-mode ladder, and the adversarial probe that pressure-tests the safety triangle for any new Investigation kind. Does NOT handle: action/remediation CRDs, which need a propose-then-approve pattern, structured-action outputs constrained at the AI tool-call boundary, and a human-approval CRD pair before any destructive verb can fire.

  2. .claude/skills/rootsync-investigation/

    Handles: investigating ACM/Config Sync RootSync errors (≥1.16) — mapping KNV1067/1068/2009/2013/2014 to root causes specific to your org's admission webhooks, sync ordering, and shared-dependency failure modes; output discipline that produces git-actionable suggestions only. Does NOT handle: RepoSync (namespace-scoped) investigation, Argo CD / Flux, remediation actions, or sync engines older than Config Sync 1.16.

  3. .claude/skills/ai-invocation-discipline/

    Handles: designing the deterministic-vs-AI boundary inside a Kubernetes controller's reconcile loop — the EvidencePack contract, the gating predicate that decides whether to call the LLM at all, the structured AI output type, the validation + filter step before any .status write. Does NOT handle: the controller framework wiring, the LLM client itself (covered in L08–10), CRD design (L12), or skill prose authoring (L13). A discipline for connecting them.

  4. .claude/skills/kro-platform-packaging/

    Handles: packaging an operator stack (controller Deployment, ServiceAccount, RBAC, default CRs, namespace) as a kro ResourceGraphDefinition with SimpleSchema for inputs and CEL for inter-resource references; includes the three failure-mode probes every kro stack should design around (in-flight upgrade idempotency, instance-create RBAC, schema evolution). Does NOT handle: the controller's reconcile logic — that stays in Go; anything requiring HTTP, LLM calls, or arbitrary code from inside reconciliation — CEL forbids it.