Lesson 15 of 28

Module 4 · Skill — Codify `terraform-kind-provision`

doc

Checking sign-in…

Why this becomes a skill

You wrote Terraform config that provisions a kind cluster, exposed a kubeconfig as an output file, and — crucially — watched what happens when state and reality drift. That drift is the single most dangerous class of Terraform failure in real work, and you just saw it first-hand. A skill that encodes "do this cleanly and recover when state lies" is worth far more than one that only handles the happy path.

The skill inherits your understanding of why Terraform plans look the way they do, not just what to type. That's the difference between a skill that works and a skill that breaks the first time someone runs it against a partially-provisioned cluster.

Codify

Send Claude:

"Codify the session we just had as a skill at .claude/skills/terraform-kind-provision/SKILL.md. Scope: provision a single local kind cluster with Terraform, output a kubeconfig file. Parameterise: cluster name, node image, kind control-plane port mappings (list of {container_port, host_port} pairs), output directory for kubeconfig. Steps: scaffold versions.tf with tehcyx/kind and hashicorp/local providers, variables.tf, main.tf, outputs.tf, run terraform init → plan → apply -auto-approve, then kubectl --kubeconfig ./kubeconfig get nodes acceptance. Keep under 140 lines."

Read the SKILL.md. If the required provider versions are pinned too loosely (latest) or too tightly (= 0.4.1), ask Claude to explain the choice and adjust. Pinning is a trade-off — loose pins break over time, tight pins break during routine maintenance.

Refine

"Add a state-sanity step after terraform apply: parse terraform state list, verify both kind_cluster.main and local_file.kubeconfig exist, and kubectl --kubeconfig <outdir>/kubeconfig get nodes returns at least one Ready node. Fail loudly with a diagnostic if any check fails."
"Remember the Break it on purpose probe: we deleted the cluster out-of-band and Terraform's state kept claiming it existed. Add a Known failures section with: (a) state drift when a cluster is deleted externally — surface the fix: terraform apply will re-create (fine) but if the user wants to remove stale state, terraform state rm kind_cluster.main. (b) name collision — kind get clusters already lists the name, fix: either rename via variable or kind delete cluster --name <old> first. (c) provider version mismatch on state re-import — fix: terraform init -upgrade + carefully re-plan."
"Add a pre-apply terraform plan -out=tfplan step and show the output to the user before applying. This enforces the 'read the plan before apply' habit the module taught."

Validate in a fresh context — happy path AND adversarial

4a — Happy path

New Claude session:

"Read .claude/skills/terraform-kind-provision/SKILL.md and provision a cluster named validate-happy with node image kindest/node:v1.30.0, one port mapping (container_port 30888, host_port 8888). Output kubeconfig to ./kubeconfig-happy. Confirm kubectl --kubeconfig ./kubeconfig-happy get nodes shows one Ready node."

If the skill one-shots it: pass. Tear down: terraform destroy -auto-approve.

4b — Adversarial

Existing cluster collision: before running the skill, manually kind create cluster --name validate-happy. Ask the skill to provision a cluster with the same name. Does it pre-check with kind get clusters and fail with "cluster already exists — delete it or rename", or does it proceed and produce a confusing Terraform error partway through?
State drift: after a successful apply, manually kind delete cluster --name validate-happy. Ask the skill to verify the cluster is healthy. Does it run the state-sanity step (section 3 refinement) and notice the node list is empty, or does it trust state blindly?
Missing provider: run the skill without network access (pull the ethernet cable or set HTTP proxy to localhost:1). terraform init will fail to download the tehcyx/kind provider. Does the skill surface this with a clear "provider download failed; check network / mirror configuration" message, or does it emit Terraform's raw multi-hundred-line error?

Promote deterministic commands to `scripts/`

The init → plan → apply sequence is deterministic. The sanity check is deterministic. Both belong in a script:

"Extract to .claude/skills/terraform-kind-provision/scripts/provision.sh (args: cluster_name, node_image, outdir). Run: terraform -chdir=<workspace> init, terraform -chdir=<workspace> plan -var cluster_name=... -var node_image=... -out=tfplan, show the plan, require user confirmation unless --yes is passed, terraform -chdir=<workspace> apply tfplan, then the state-sanity block. set -euo pipefail. Exit non-zero with a distinct code per failure class (1 = plan failed, 2 = apply failed, 3 = sanity check failed)."

Distinct exit codes per class are small cost / big payoff for a skill that gets called from CI.

Know the boundary

At the top of .claude/skills/terraform-kind-provision/SKILL.md:

This skill handles: local kind-based Kubernetes provisioning via Terraform, kubeconfig extraction to a local file, pre-apply plan display, post-apply state sanity, and the state-drift recovery pattern.
This skill does NOT handle: cloud provisioning (GKE/EKS/AKS — these need cloud credentials, IAM roles, VPC networking, and should use a different skill; see module 7 for IAM), remote state backends (S3/GCS with DynamoDB/GCS locking), multi-environment workspaces, module composition, import of existing resources, destroy-protect guardrails, or production-grade credential handling (no credential file is written to disk except the kubeconfig).

You're done when

A fresh Claude session provisions a cluster from one prompt, your adversarial probes (name collision, state drift, missing provider) each produce clear actionable messages, and the distinct exit codes from provision.sh correctly signal the failure class.