Free course
Kubernetes Operators, Istio, Incident Response & AI
Build a Kubernetes operator with Claude Code, then use it to manage an AI workload, embed AI inside the reconciler itself, and wire it into an incident-response flow on a real service mesh — so the operator is both the managed system and a first responder.
- Level
- Intermediate
- Time
- 320 min
- Prerequisite
- Comfort reading YAML and running kubectl against a running cluster. No prior operator-authoring experience needed.
- Lessons
- 16
About this course
Hands-on lab runs on kind + Istio with the Bookinfo sample app, giving you a multi-service mesh you can break on purpose and observe end to end. The takeaway is what the combination of operator + service mesh + AI + Claude Code unlocks, not a generic Kubernetes tutorial.
What you'll learn
- Explain the operator pattern and when it's the right answer (and when a plain controller isn't enough).
- Identify the four Kubernetes nouns an operator composes — CRDs, controllers, reconcile loops, and the API server contract.
- Bootstrap a multi-node kind cluster with Istio and the Bookinfo sample so later lessons have a realistic mesh to break and reason about.
- Tear the lab down cleanly between sessions — cluster, port-forward, and Docker Desktop — without leaving processes or VMs running in the background.
- Spot the unique leverage Claude Code gives when authoring an operator vs. reading one.
- Understand how an operator can both manage an AI workload and embed AI inside the reconcile loop for incident response.
- Design a provider-pluggable LLM interface in Go and implement it against local Ollama/Gemma and the Anthropic Claude API — with classified errors, ctx-respecting cancellation, and skill files validated in fresh Claude Code sessions.
- Explain why a Skill belongs in the cluster as a first-class CRD, and how the safety triangle (RBAC + tool whitelist + output filter) bounds what the AI inside a controller is permitted to do.
- Design a read-only investigation CRD pair (Skill + Investigation), its RBAC envelope, and its trigger-mode ladder — and pressure-test the design with an adversarial probe that walks malicious skill prose through each layer of the safety triangle.
- Write a domain-specific RootSync investigation skill that's useful on a developer's laptop today AND slots into the in-cluster controller when it ships — one artefact, two homes.
- Draw the discipline between deterministic controller code and AI synthesis — when the LLM is invoked, what it receives, what it returns, and what the controller still owns regardless of what the LLM said. Most investigations resolve without calling the LLM; the ones that don't are where the org-specific Skill prose earns its keep.
- Package the whole investigation stack — CRDs, RBAC, default Skill, controller Deployment — as a single kro ResourceGraphDefinition that platform users install with one
kubectl apply, while keeping the AI-in-loop reconcile out of kro (where CEL can't reach) and in the Go controller (where it belongs).
Syllabus
320m total
- Why operators + AI 7m
- What is a Kubernetes operator? 25m
- Foundations check 13m Sign in
- Lab prerequisites 13m
- Bootstrap the lab cluster 10m
- Deploy Bookinfo on the mesh 19m
- Tear down the lab 6m
- Design the LLM interface contract 25m
- Local Gemma adapter via Ollama 25m
- Cloud adapter — Anthropic Claude API 25m
- Skills as cluster-native domain knowledge 25m
- Design the RootSync investigation surface 30m
- Write your team's RootSync investigation skill 30m
- Investigation architecture recall 12m Sign in
- AI as a scalpel — when the controller calls the LLM, and when it doesn't 25m
- Package the investigation stack with kro 30m
Skills you'll build
4 reusable Claude Code skills. Each is validated in a fresh session against both valid and adversarial input, and documents what it handles as well as what it does not.
- .claude/skills/ operator-investigation-design/
- .claude/skills/ rootsync-investigation/
- .claude/skills/ ai-invocation-discipline/
- .claude/skills/ kro-platform-packaging/
What each skill handles 4 boundary statements
-
.claude/skills/operator-investigation-design/
Handles: designing read-only Investigation CRDs paired with a Skill CRD, the RBAC envelope for the controller, schema-level CEL validation, the trigger-mode ladder, and the adversarial probe that pressure-tests the safety triangle for any new Investigation kind. Does NOT handle: action/remediation CRDs, which need a propose-then-approve pattern, structured-action outputs constrained at the AI tool-call boundary, and a human-approval CRD pair before any destructive verb can fire.
-
.claude/skills/rootsync-investigation/
Handles: investigating ACM/Config Sync RootSync errors (≥1.16) — mapping KNV1067/1068/2009/2013/2014 to root causes specific to your org's admission webhooks, sync ordering, and shared-dependency failure modes; output discipline that produces git-actionable suggestions only. Does NOT handle: RepoSync (namespace-scoped) investigation, Argo CD / Flux, remediation actions, or sync engines older than Config Sync 1.16.
-
.claude/skills/ai-invocation-discipline/
Handles: designing the deterministic-vs-AI boundary inside a Kubernetes controller's reconcile loop — the EvidencePack contract, the gating predicate that decides whether to call the LLM at all, the structured AI output type, the validation + filter step before any
.statuswrite. Does NOT handle: the controller framework wiring, the LLM client itself (covered in L08–10), CRD design (L12), or skill prose authoring (L13). A discipline for connecting them. -
.claude/skills/kro-platform-packaging/
Handles: packaging an operator stack (controller Deployment, ServiceAccount, RBAC, default CRs, namespace) as a kro ResourceGraphDefinition with SimpleSchema for inputs and CEL for inter-resource references; includes the three failure-mode probes every kro stack should design around (in-flight upgrade idempotency, instance-create RBAC, schema evolution). Does NOT handle: the controller's reconcile logic — that stays in Go; anything requiring HTTP, LLM calls, or arbitrary code from inside reconciliation — CEL forbids it.