Free course

Kubernetes Operators, Istio, Incident Response & AI

Build a Kubernetes operator with Claude Code, then use it to manage an AI workload, embed AI inside the reconciler itself, and wire it into an incident-response flow on a real service mesh — so the operator is both the managed system and a first responder.

Level: Intermediate
Time: 320 min
Prerequisite: Comfort reading YAML and running kubectl against a running cluster. No prior operator-authoring experience needed.
Lessons: 16

About this course

Hands-on lab runs on kind + Istio with the Bookinfo sample app, giving you a multi-service mesh you can break on purpose and observe end to end. The takeaway is what the combination of operator + service mesh + AI + Claude Code unlocks, not a generic Kubernetes tutorial.

What you'll learn

Explain the operator pattern and when it's the right answer (and when a plain controller isn't enough).
Identify the four Kubernetes nouns an operator composes — CRDs, controllers, reconcile loops, and the API server contract.
Bootstrap a multi-node kind cluster with Istio and the Bookinfo sample so later lessons have a realistic mesh to break and reason about.
Tear the lab down cleanly between sessions — cluster, port-forward, and Docker Desktop — without leaving processes or VMs running in the background.
Spot the unique leverage Claude Code gives when authoring an operator vs. reading one.
Understand how an operator can both manage an AI workload and embed AI inside the reconcile loop for incident response.
Design a provider-pluggable LLM interface in Go and implement it against local Ollama/Gemma and the Anthropic Claude API — with classified errors, ctx-respecting cancellation, and skill files validated in fresh Claude Code sessions.
Explain why a Skill belongs in the cluster as a first-class CRD, and how the safety triangle (RBAC + tool whitelist + output filter) bounds what the AI inside a controller is permitted to do.
Design a read-only investigation CRD pair (Skill + Investigation), its RBAC envelope, and its trigger-mode ladder — and pressure-test the design with an adversarial probe that walks malicious skill prose through each layer of the safety triangle.
Write a domain-specific RootSync investigation skill that's useful on a developer's laptop today AND slots into the in-cluster controller when it ships — one artefact, two homes.
Draw the discipline between deterministic controller code and AI synthesis — when the LLM is invoked, what it receives, what it returns, and what the controller still owns regardless of what the LLM said. Most investigations resolve without calling the LLM; the ones that don't are where the org-specific Skill prose earns its keep.
Package the whole investigation stack — CRDs, RBAC, default Skill, controller Deployment — as a single kro ResourceGraphDefinition that platform users install with one kubectl apply, while keeping the AI-in-loop reconcile out of kro (where CEL can't reach) and in the Go controller (where it belongs).

Syllabus

320m total

Skills you'll build

4 reusable Claude Code skills. Each is validated in a fresh session against both valid and adversarial input, and documents what it handles as well as what it does not.

.claude/skills/ operator-investigation-design/
.claude/skills/ rootsync-investigation/
.claude/skills/ ai-invocation-discipline/
.claude/skills/ kro-platform-packaging/

What each skill handles 4 boundary statements

.claude/skills/operator-investigation-design/

Handles: designing read-only Investigation CRDs paired with a Skill CRD, the RBAC envelope for the controller, schema-level CEL validation, the trigger-mode ladder, and the adversarial probe that pressure-tests the safety triangle for any new Investigation kind. Does NOT handle: action/remediation CRDs, which need a propose-then-approve pattern, structured-action outputs constrained at the AI tool-call boundary, and a human-approval CRD pair before any destructive verb can fire.
.claude/skills/rootsync-investigation/

Handles: investigating ACM/Config Sync RootSync errors (≥1.16) — mapping KNV1067/1068/2009/2013/2014 to root causes specific to your org's admission webhooks, sync ordering, and shared-dependency failure modes; output discipline that produces git-actionable suggestions only. Does NOT handle: RepoSync (namespace-scoped) investigation, Argo CD / Flux, remediation actions, or sync engines older than Config Sync 1.16.
.claude/skills/ai-invocation-discipline/

Handles: designing the deterministic-vs-AI boundary inside a Kubernetes controller's reconcile loop — the EvidencePack contract, the gating predicate that decides whether to call the LLM at all, the structured AI output type, the validation + filter step before any .status write. Does NOT handle: the controller framework wiring, the LLM client itself (covered in L08–10), CRD design (L12), or skill prose authoring (L13). A discipline for connecting them.
.claude/skills/kro-platform-packaging/

Handles: packaging an operator stack (controller Deployment, ServiceAccount, RBAC, default CRs, namespace) as a kro ResourceGraphDefinition with SimpleSchema for inputs and CEL for inter-resource references; includes the three failure-mode probes every kro stack should design around (in-flight upgrade idempotency, instance-create RBAC, schema evolution). Does NOT handle: the controller's reconcile logic — that stays in Go; anything requiring HTTP, LLM calls, or arbitrary code from inside reconciliation — CEL forbids it.