learnclaude .dev
← Kubernetes Operators, Istio, Incident Response & AI

Lesson 16 of 16

Package the investigation stack with kro

doc

You have the architecture (L11), the CRD pair (L12), and the skill prose (L13). Before we write a line of Go for the controller, we settle one operational question: how does this stack actually install into a cluster?

The naive answer is "a bag of YAMLs": a CRD file for Skill, a CRD file for RootSyncInvestigation, a ServiceAccount, a ClusterRole, a ClusterRoleBinding, maybe a default Skill CR, a Deployment for the controller, perhaps a ConfigMap. Six to ten files, an implicit order, a kustomize overlay per environment, a helm template for the variable bits. It works. It's also exactly the kind of toil that buries a platform team when the second Investigation kind ships and the install procedure has to be duplicated.

There is a better answer that's now a kubernetes-sigs project: kro — the Kube Resource Orchestrator. kro is currently in alpha (kro.run/v1alpha1); install via the OCI chart at registry.k8s.io/kro/charts/kro (shown in the install script later in this lesson), and production users should pin to a specific chart version rather than tracking latest. This lesson teaches just enough kro to package the Investigation stack as a single kubectl apply, while being honest about exactly where kro's responsibility ends and the AI-in-loop controller's begins.

What kro is, in one paragraph

kro introduces one custom resource called ResourceGraphDefinition (RGD). You declare three things inside an RGD: a SimpleSchema (the shape of a new user-facing CRD), a list of underlying resources (the YAML for everything that should be created when the user instantiates the new kind), and CEL expressions that template values between those resources and the user's instance. kro reads the RGD, generates the new CRD, and runs its own controller that watches instances of the new kind and reconciles the underlying resource graph into the cluster — handling dependency ordering, drift detection, and lifecycle management. The platform team writes one RGD; the platform user writes one instance YAML; everything else falls out.

Think of it as a templating operator built into the cluster, with type-safety, CEL-evaluated logic, and no Go code from you.

SimpleSchema and CEL — just enough to read the next example

kro's SimpleSchema is a compact way to declare a CRD's fields. The patterns you'll see in this lesson:

schema:
  apiVersion: v1alpha1
  kind: RootSyncInvestigationStack
  spec:
    namespace: string | default="rootsync-investigations"
    controllerImage: string
    defaultEnvironments: '[]string' | default=["dev"]
    enableDefaultSkill: boolean | default=true

Each line declares a field's name, a type (string, integer, boolean, '[]string', nested objects), and optional modifiers (default=..., required=true). The schema becomes the OpenAPI schema of the generated CRD — apply-time validation is free.

CEL (Common Expression Language) is how kro templates values across the resource graph. You'll see two forms:

  • ${schema.spec.fieldName} — read a field from the user's instance spec
  • ${otherResource.spec.someField} — read a field from another underlying resource in the graph (kro figures out the dependency order from these references)

CEL is non-Turing-complete by design: no loops, no I/O, no arbitrary code execution. That's the entire safety story of kro — and the entire boundary of what kro can do for you.

The honest boundary: what kro can NOT do

This is the part to get straight before writing any RGD. CEL cannot call an LLM. CEL cannot make HTTP requests. CEL cannot run arbitrary code. That's a feature, not a limitation — it's why an RGD is verifiable and safe — but it also means kro is fundamentally the wrong tool for the AI-in-loop reconcile we designed in L11–L13.

The packaging boundary is clean:

  • kro handles installation. The CRDs, the namespace, the ServiceAccount, the RBAC, the controller's Deployment, the default Skill CR. All of this is inert resource composition — CEL templating is exactly what it needs.
  • The Go controller handles runtime. Loading the Skill, talking to the LLM through the pkg/llm client from L08–10, fetching structured cluster inputs, writing findings back to .status. None of this is CEL-expressible.

Two layers, two tools. kro is not a replacement for an operator; it's a replacement for the boilerplate around an operator. The interesting bit — the AI reasoning — is still yours to write, which is why the rest of the course exists.

The RGD that packages the Investigation stack

Here is the RGD a platform team writes once and ships to every cluster that wants the investigation platform:

apiVersion: kro.run/v1alpha1
kind: ResourceGraphDefinition
metadata:
  name: rootsync-investigation-stack
spec:
  schema:
    apiVersion: v1alpha1
    kind: RootSyncInvestigationStack
    spec:
      namespace: string | default="rootsync-investigations"
      controllerImage: string | default="ghcr.io/example/investigation-controller:v0.1.0"
      defaultEnvironments: '[]string' | default=["dev"]
      enableDefaultSkill: boolean | default=true
    status:
      controllerReady: ${deployment.status.conditions.filter(c, c.type == 'Available')[0].status}
      defaultSkillApplied: ${schema.spec.enableDefaultSkill}

  resources:
    - id: ns
      template:
        apiVersion: v1
        kind: Namespace
        metadata:
          name: ${schema.spec.namespace}

    - id: sa
      template:
        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: rootsync-investigation-controller
          namespace: ${schema.spec.namespace}

    - id: clusterRole
      template:
        apiVersion: rbac.authorization.k8s.io/v1
        kind: ClusterRole
        metadata:
          name: rootsync-investigation-controller
        rules:
          - apiGroups: ["skills.learnclaude.dev"]
            resources: ["skills"]
            verbs: ["get", "list", "watch"]
          - apiGroups: ["investigations.learnclaude.dev"]
            resources: ["rootsyncinvestigations"]
            verbs: ["get", "list", "watch", "update", "patch"]
          - apiGroups: ["investigations.learnclaude.dev"]
            resources: ["rootsyncinvestigations/status"]
            verbs: ["update", "patch"]
          - apiGroups: ["configsync.gke.io"]
            resources: ["rootsyncs"]
            verbs: ["get", "list"]
          - apiGroups: [""]
            resources: ["pods", "pods/log"]
            verbs: ["get", "list"]
          - apiGroups: ["", "events.k8s.io"]
            resources: ["events"]
            verbs: ["get", "list"]
          - apiGroups: ["apps"]
            resources: ["deployments", "statefulsets", "daemonsets"]
            verbs: ["get", "list"]
          - apiGroups: [""]
            resources: ["services", "configmaps", "namespaces"]
            verbs: ["get", "list"]

    - id: clusterRoleBinding
      template:
        apiVersion: rbac.authorization.k8s.io/v1
        kind: ClusterRoleBinding
        metadata:
          name: rootsync-investigation-controller
        roleRef:
          apiGroup: rbac.authorization.k8s.io
          kind: ClusterRole
          name: rootsync-investigation-controller
        subjects:
          - kind: ServiceAccount
            name: rootsync-investigation-controller
            namespace: ${schema.spec.namespace}

    - id: defaultSkill
      includeWhen:
        - ${schema.spec.enableDefaultSkill}
      template:
        apiVersion: skills.learnclaude.dev/v1alpha1
        kind: Skill
        metadata:
          name: rootsync-investigation
          # Skill is cluster-scoped (see L12) — no namespace.
        spec:
          # In production this body is loaded from a ConfigMap or sealed
          # in the RGD. For the lesson it's a placeholder pointing at L13.
          body: "(see lesson 13 — the org-specific investigation skill)"
          handles: |
            Read-only investigation of RootSync errors in our ACM/Config Sync
            setup. Output is git-actionable suggestions for a human.
          doesNotHandle: |
            RepoSync, Argo CD, Flux, remediation actions, sync engines older
            than Config Sync 1.16.
          targetCRDs: [RootSyncInvestigation]
          requiresConfirmation: false
          environments: ${schema.spec.defaultEnvironments}
          version: "2026-05-12"

    - id: deployment
      template:
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: rootsync-investigation-controller
          namespace: ${schema.spec.namespace}
        spec:
          replicas: 1
          selector:
            matchLabels:
              app: rootsync-investigation-controller
          template:
            metadata:
              labels:
                app: rootsync-investigation-controller
            spec:
              serviceAccountName: rootsync-investigation-controller
              containers:
                - name: controller
                  image: ${schema.spec.controllerImage}
                  args:
                    - --namespace=${schema.spec.namespace}
                  env:
                    - name: WATCH_NAMESPACE
                      value: ${schema.spec.namespace}

Five resources, one schema, one RGD. The platform team applies this once per cluster. kro generates the RootSyncInvestigationStack CRD and registers its own controller to reconcile instances of it.

Note one piece that does NOT belong in this RGD: the CRDs for Skill and RootSyncInvestigation themselves. Those are platform-level prerequisites — you install them once with kubectl apply -f crds/ (or via a separate kro RGD that ships the CRDs) before any RGD that references them can be applied. Mixing schema-installation and stack-installation in one RGD creates a bootstrap loop: kro reconciles the Skill CR before the Skill CRD exists, the Skill CR fails admission, the stack never converges. Keep them separate.

What the user does

End user — or the platform team for each environment — writes one file:

apiVersion: kro.run/v1alpha1
kind: RootSyncInvestigationStack
metadata:
  name: production
  namespace: kube-system    # the instance itself; kro creates resources in spec.namespace
spec:
  namespace: investigations-prod
  controllerImage: ghcr.io/example/investigation-controller:v0.2.1
  defaultEnvironments: ["prod"]
  enableDefaultSkill: true

kubectl apply -f investigation-stack-prod.yaml. kro:

  1. Validates the instance against the SimpleSchema (typos in field names fail at apply time).
  2. Renders each resource template with the user's values substituted in via CEL.
  3. Computes the dependency graph from the CEL references (Namespace before ServiceAccount, ClusterRole + Binding before Deployment, etc.).
  4. Applies the resources in dependency order.
  5. Watches everything; reconciles drift.
  6. Surfaces aggregate status on .status.controllerReady so the user can kubectl get rootsyncinvestigationstack -n kube-system and see one health line.

Per-environment customisation — different default Skills for dev/staging/prod, different controller images, different namespaces — is one instance YAML each, not a kustomize tree.

Claude-guided task — author the RGD

Fresh directory. New repo. The RGD lives separately from the controller's Go code:

mkdir ~/operator-investigation-rgd && cd ~/operator-investigation-rgd
git init

Open Claude Code. Drive the work through it, but read every line before saving.

Ask Claude, in order:

  1. Draft the SimpleSchema. Start from the spec shape above. Push Claude to add required=true to fields that have no sensible default (e.g. nothing — every field here has a default — but the exercise of explaining why each field has a default surfaces design intent). Ask: "What's the implication of every field having a default for the install UX vs. for accidental misconfiguration?"
  2. Generate the resource list with full CEL references. Ask Claude to derive the dependency order from the CEL graph (Namespace → ServiceAccount → Deployment, ClusterRole → ClusterRoleBinding → uses the SA, etc.). Have it explain why ${schema.spec.namespace} in the Deployment makes that resource depend on ns. The dependency derivation is the whole reason kro exists; if Claude can't articulate it, dig until it can.
  3. Add the includeWhen guard on the default Skill. Walk through what happens if enableDefaultSkill: false: the Skill resource is skipped entirely, no apply attempt, no stale-but-disabled resource. This is the only conditional in the RGD; everything else is unconditional.
  4. Status synthesis. The status.controllerReady field uses a CEL filter on the Deployment's conditions. Ask Claude "What happens to controllerReady if the Deployment hasn't created any conditions yet (fresh apply)?" The answer is the CEL expression returns null or the filter is empty — be deliberate about which, because the user's kubectl get output depends on it.

Three specific questions to push back on Claude's first draft:

  • "Why are the Skill and RootSyncInvestigation CRDs themselves NOT in this RGD? Walk me through the bootstrap loop that would result if I added them."
  • "Could we express the AI-in-loop reconcile logic in CEL — say, 'if the RootSync has a KNV2009 error, look up the Skill and... ?' Where does that approach hit the wall? What does CEL physically not support?"
  • "The default Skill's body field is a literal placeholder here. In production, where should the prose actually live — inline in the RGD, in a ConfigMap mounted by the kro controller, in a separate Skill CR applied outside the RGD? What's the trade-off?"

That last question doesn't have one right answer. Different orgs land in different places — the lesson is to make the trade-off conscious, not to dictate it.

Project rule. Stay close to your problem. Read every CEL expression and ask: what other resource does this template depend on? Read every template: and ask: would I be comfortable applying this manually if kro vanished? If the answer is no, the RGD is hiding too much.

Adversarial probe — what kro can NOT save you from

Three failure modes the RGD-as-installer doesn't cover. Walk through each:

  1. An RGD upgrade while Investigations are in flight. You bump controllerImage from v0.2.1 to v0.3.0. kro patches the Deployment, which triggers a rolling update of the controller pod. In-flight Investigations whose reconcile was mid-LLM-call get... what? The old pod dies (SIGTERM, grace period elapses), the new pod starts, the new pod picks up the in-flight Investigation from its .status.phase = Running and either re-reconciles from scratch or sees stale partial state. kro doesn't know about Investigation lifecycle. It just rolls the Deployment. Designing the controller to be idempotent on its own CR is on YOU, not kro — same as for any operator. The probe makes it explicit.

  2. A user instance with a malicious controller image. Someone with permission to create RootSyncInvestigationStack instances sets controllerImage: attacker/exfil:latest. kro happily creates a Deployment with that image, mounted to your ServiceAccount with all its RBAC. The platform's security boundary is who can create RGD instances, not the RGD itself. Mitigation: lock the instance kind to platform-team-only via Kubernetes RBAC (create rootsyncinvestigationstacks.kro.run is restricted), or add a kro CEL validation that constrains controllerImage to a regex matching your container registry. The default RGD does neither — write the validation deliberately if your threat model requires it.

  3. Schema drift. You add a new field auditWebhookUrl: string to the SimpleSchema in v2 of the RGD. Existing instances written against v1 don't have the field. kro's behaviour here depends on whether the field is required and whether you provide a default; the wrong combination silently breaks running instances. Schema evolution is a CRD problem, not a kro problem — kro inherits the same versioning constraints any CRD has, including the need for storage versions, conversion webhooks if the schema changes shape, and explicit defaults for backward compat.

The point of the probe is the same as in L12: not to prove the design works, but to find the layer that has to bear weight beyond what kro provides. Those layers (the controller's idempotency, the Kubernetes RBAC on instance creation, the CRD schema's evolution discipline) are not optional just because kro hides them.

Codify as a skill — kro-platform-packaging

Open .claude/skills/kro-platform-packaging/SKILL.md. Have Claude write the first draft, then critique and tighten.

The skill should capture, at a META level (not RootSync-specific):

  • kro is for inert resource composition. Use it when the install is a graph of YAMLs templated from a small set of inputs. Don't use it when the runtime needs HTTP, an LLM, or arbitrary code — those go in a Go controller.
  • The two-tool boundary. kro packages installation; the custom operator handles runtime AI. Mixing them is the temptation; the discipline is keeping them separate.
  • CRDs your RGD references must exist first. Bootstrap-loop trap: never put a CRD and a CR of that CRD in the same RGD. Separate RGDs (or a kubectl apply step) for CRDs; the stack RGD assumes CRDs are pre-installed.
  • The dependency graph comes from CEL references. Two resources are dependent iff one references the other via ${id.field}. If you want an implicit dependency (e.g. wait for a Job to complete before applying the rest), kro has a dependsOn: field — use it explicitly rather than hoping CEL inference catches it.
  • Status synthesis with CEL filters. Use ${deployment.status.conditions.filter(c, c.type == 'Available')[0].status} to roll one Deployment field up to the stack's status. The filter's edge case (empty conditions list) needs explicit handling.
  • The three probe failure modes above: in-flight upgrade idempotency, instance-creator security boundary, schema evolution discipline. Every kro-packaged stack inherits these problems; designing around them is the platform team's job.

End the SKILL.md with the boundary statement:

This skill handles: packaging a Kubernetes operator stack (CRDs assumed pre-installed, plus its controller Deployment, ServiceAccount, RBAC, default CRs, and namespace) as a kro ResourceGraphDefinition with SimpleSchema for user inputs and CEL for inter-resource references. Includes the three failure-mode probes (upgrade idempotency, instance-create RBAC, schema evolution) every kro stack should design around.

This skill does NOT handle: the controller's reconcile logic — that's Go, not kro. Anything requiring HTTP, LLM calls, file I/O, or arbitrary code from within reconciliation — CEL forbids it. CRD lifecycle (installation, conversion webhooks, storage version) — kro inherits but doesn't solve. Helm migrations or Crossplane comparisons — out of scope here.

Validate in a fresh Claude Code session — close this one, start a new one with no history, point it at the skill, ask: "Package a CertificateInvestigationStack as a kro RGD. The controller is ghcr.io/example/cert-investigation-controller, RBAC needs read on cert-manager.io/certificates and cert-manager.io/certificaterequests plus events." Output should be a kro RGD shaped like the RootSync one — Namespace, SA, ClusterRole tailored to cert-manager kinds, ClusterRoleBinding, default Skill (optional via includeWhen), controller Deployment. RootSync-free. Same three-probe discussion in the lesson plan.

Adversarial validation: ask the fresh session to "add an HTTP webhook call to the RGD's reconcile path so that on every instance create we POST to an external service." The skill must refuse: kro can't do HTTP, period — the right place for that is the Go controller, not the RGD. A skill that quietly drifts into "well, you could ... " is broken. Tighten and re-test.

Promote deterministic commands to scripts/

# scripts/install.sh — one-shot bootstrap for a fresh cluster
#!/usr/bin/env bash
set -euo pipefail
# 1. Install kro itself (helm chart published as an OCI artifact)
helm install kro oci://registry.k8s.io/kro/charts/kro \
  --namespace kro-system --create-namespace
# 2. Install the platform CRDs (Skill + RootSyncInvestigation)
kubectl apply -f crds/
# 3. Install the stack RGD
kubectl apply -f rgd-rootsync-investigation-stack.yaml
# 4. Wait for kro to register the generated CRD
kubectl wait --for=condition=Established \
  crd/rootsyncinvestigationstacks.kro.run --timeout=60s
# scripts/instance.sh <env> — apply an environment-specific instance
#!/usr/bin/env bash
set -euo pipefail
ENV="${1:?usage: $0 <dev|staging|prod>}"
kubectl apply -f instances/$ENV.yaml

Two scripts cover the full install path. The skill references them rather than restating the kubectl invocations.

Acceptance test

From the working directory:

./scripts/install.sh
kubectl get crd rootsyncinvestigationstacks.kro.run
# expected: shows the CRD, Established=True

./scripts/instance.sh dev
kubectl get rootsyncinvestigationstack production -n kube-system -o yaml
# expected: .status.controllerReady is eventually "True"

kubectl get all,skills,rootsyncinvestigations -n investigations-prod
# expected: namespace populated with SA, Deployment, default Skill,
# all reconciled by kro. No investigations yet (none triggered).

If the controller image doesn't exist (it doesn't yet — that's the next chapters), the Deployment will land in ImagePullBackOff. That's expected for this lesson: we packaged the stack; we haven't written the controller. The kro layer is done.

Coming up

The next chapters write the Go controller that the RGD references. Scaffold with kubebuilder or operator-sdk (your choice), use the pkg/llm client from L08–10, load Skill.spec.body from the cluster on each reconcile, run the AI loop within the RBAC the RGD already provisioned, and write findings back to the Investigation CR. By the time we wire it up, the install is solved and you can iterate on the controller image while leaving the user-facing install contract — one RootSyncInvestigationStack instance — completely unchanged.

View source documentation →