Lesson 19 of 28
Module 5 · Skill — Codify `prom-grafana-instrument`
Why this becomes a skill
You instrumented a Go service, installed kube-prometheus-stack, wired a ServiceMonitor, and built a Grafana panel. You also broke the ServiceMonitor selector and watched the dashboard go silently blank — the single most common and most dangerous failure mode in observability. A skill that captures "install the stack, wire the scrape, and verify the scrape is actually working" is worth more than one that only handles the install.
The skill must inherit your knowledge that a working install is not a working observability setup. A working setup proves it is scraping, and alerts you when it stops.
Codify
Send Claude:
"Codify the session we just had as a skill at
.claude/skills/prom-grafana-instrument/SKILL.md. Scope: add Prometheus metrics to a Go HTTP service (CounterVec labelled by path + status code,/metricsendpoint via promhttp), installkube-prometheus-stackintomonitoringnamespace, add aServiceMonitorto the service's Helm chart, build a RED-style Grafana panel. Parameterise: service name, namespace, metric prefix (default = service name snake-cased), scrape interval, chart path. Keep SKILL.md under 160 lines."
Read the SKILL.md. In particular, confirm it spells out the ServiceMonitor selector-matching rule explicitly — the match between the monitor's selector.matchLabels and the Service's labels is the thing that silently breaks.
Refine
"Add a
Verify scrapingstep usingkubectl port-forwardto Prometheus and a check thatup{job=<service>}returns 1. Fail if the target is DOWN or doesn't exist. This is what detects a broken ServiceMonitor immediately rather than 15 minutes later in Grafana.""Remember the Break it on purpose probe: we broke
selector.matchLabelsto include a non-matching key, and Prometheus silently stopped scraping — no error, no alert, just a flat line on the dashboard. Add aKnown failuresentry: scrape-gap — user's dashboard is blank or flat. Diagnostic:kubectl get servicemonitor -A(does the ServiceMonitor exist?),kubectl get endpoints <service> -n <ns>(does the Service have endpoints?), Prometheus UI /targets (is the job UP?). Each narrows the cause by one step.""Add a default PrometheusRule in the Helm chart that alerts on
absent(over_time(<metric>_total[5m]))with severitywarning. That's how you detect scrape gaps without staring at dashboards."
Validate in a fresh context — happy path AND adversarial
4a — Happy path
New Claude session:
"Read
.claude/skills/prom-grafana-instrument/SKILL.mdand instrument a Python FastAPI service calledtodos-apiwith request-rate-by-code metrics. Service is already deployed via Helm at chart path./chart. Installkube-prometheus-stackif not present. Add a ServiceMonitor. Build a Grafana panel. Confirm scraping is UP."
The skill adapting from Go to Python is non-trivial (different client library, different wrapping pattern). If the skill hard-codes the Go instrumentation, the template is under-parameterised — refine.
4b — Adversarial
- Selector drift: after scraping is confirmed working, change the Helm chart's
selectorLabelsintemplates/_helpers.tpl(so the Service's labels change).helm upgrade. Ask the skill to verify scraping. Does the Verify scraping step catch thatupis no longer 1 for this job, or does it silently say "install complete"? - Wrong metric name: feed the skill a target service whose
/metricsendpoint uses a different naming convention (e.g.,myapp_requestsinstead ofmyapp_http_requests_total). Does the Grafana panel query work (it shouldn't — wrong metric name), and does the skill catch the mismatch before building the dashboard? kube-prometheus-stackalready installed in a different namespace: another team has already installed it as releaseplatform-promin namespaceplatform. Does the skill notice and reuse, or does it try to install a second copy intomonitoringand produce a cluster-wide CRD collision?
Promote deterministic commands to scripts/
The repo add, the install, the verify-scrape block, and the alert-rule apply are all deterministic:
"Extract to
.claude/skills/prom-grafana-instrument/scripts/install-kps.sh(no args — idempotent) and.claude/skills/prom-grafana-instrument/scripts/verify-scrape.sh(args: service name, namespace). The verify-scrape script port-forwards Prometheus, queriesup{job=<service>}=1via HTTP, and exits non-zero if the job is DOWN or missing.set -euo pipefail."
install-kps.sh should helm repo add conditionally (don't fail if it's already added), helm upgrade --install with --reuse-values so re-runs don't clobber user overrides, and kubectl wait --for=condition=available on the operator deployment.
Know the boundary
At the top of .claude/skills/prom-grafana-instrument/SKILL.md:
- This skill handles: HTTP-service metrics (counters / histograms on
{path, code}),kube-prometheus-stackinstall or reuse,ServiceMonitorcreation with selector hygiene, a RED-method Grafana panel, and a scrape-gap alerting rule. - This skill does NOT handle: logs (no Loki / EFK wiring) or traces (no OpenTelemetry collector) — metrics only. No long-term storage (Thanos/Mimir). No multi-tenant scrape isolation (all ServiceMonitors are discovered cluster-wide with
serviceMonitorSelectorNilUsesHelmValues=false— tighten this for shared clusters; see module 7 for RBAC-scoped scrape boundaries). No alert routing (AlertManager is installed but not configured with receivers — wire Slack / PagerDuty yourself). No dashboard-as-code beyond a single panel — use Grafonnet/grr for a real dashboard pipeline.
You're done when
A fresh Claude session instruments a different-language service, installs or reuses the stack, confirms up=1, and produces an accurate RED-method dashboard — AND when you feed it the adversarial scenarios (selector drift, metric mismatch, pre-existing stack in another namespace), it either recovers correctly or fails loudly with an actionable message.