Lesson 19 of 28

Module 5 · Skill — Codify `prom-grafana-instrument`

doc

Checking sign-in…

Why this becomes a skill

You instrumented a Go service, installed kube-prometheus-stack, wired a ServiceMonitor, and built a Grafana panel. You also broke the ServiceMonitor selector and watched the dashboard go silently blank — the single most common and most dangerous failure mode in observability. A skill that captures "install the stack, wire the scrape, and verify the scrape is actually working" is worth more than one that only handles the install.

The skill must inherit your knowledge that a working install is not a working observability setup. A working setup proves it is scraping, and alerts you when it stops.

Codify

Send Claude:

"Codify the session we just had as a skill at .claude/skills/prom-grafana-instrument/SKILL.md. Scope: add Prometheus metrics to a Go HTTP service (CounterVec labelled by path + status code, /metrics endpoint via promhttp), install kube-prometheus-stack into monitoring namespace, add a ServiceMonitor to the service's Helm chart, build a RED-style Grafana panel. Parameterise: service name, namespace, metric prefix (default = service name snake-cased), scrape interval, chart path. Keep SKILL.md under 160 lines."

Read the SKILL.md. In particular, confirm it spells out the ServiceMonitor selector-matching rule explicitly — the match between the monitor's selector.matchLabels and the Service's labels is the thing that silently breaks.

Refine

"Add a Verify scraping step using kubectl port-forward to Prometheus and a check that up{job=<service>} returns 1. Fail if the target is DOWN or doesn't exist. This is what detects a broken ServiceMonitor immediately rather than 15 minutes later in Grafana."
"Remember the Break it on purpose probe: we broke selector.matchLabels to include a non-matching key, and Prometheus silently stopped scraping — no error, no alert, just a flat line on the dashboard. Add a Known failures entry: scrape-gap — user's dashboard is blank or flat. Diagnostic: kubectl get servicemonitor -A (does the ServiceMonitor exist?), kubectl get endpoints <service> -n <ns> (does the Service have endpoints?), Prometheus UI /targets (is the job UP?). Each narrows the cause by one step."
"Add a default PrometheusRule in the Helm chart that alerts on absent(over_time(<metric>_total[5m])) with severity warning. That's how you detect scrape gaps without staring at dashboards."

Validate in a fresh context — happy path AND adversarial

4a — Happy path

New Claude session:

"Read .claude/skills/prom-grafana-instrument/SKILL.md and instrument a Python FastAPI service called todos-api with request-rate-by-code metrics. Service is already deployed via Helm at chart path ./chart. Install kube-prometheus-stack if not present. Add a ServiceMonitor. Build a Grafana panel. Confirm scraping is UP."

The skill adapting from Go to Python is non-trivial (different client library, different wrapping pattern). If the skill hard-codes the Go instrumentation, the template is under-parameterised — refine.

4b — Adversarial

Selector drift: after scraping is confirmed working, change the Helm chart's selectorLabels in templates/_helpers.tpl (so the Service's labels change). helm upgrade. Ask the skill to verify scraping. Does the Verify scraping step catch that up is no longer 1 for this job, or does it silently say "install complete"?
Wrong metric name: feed the skill a target service whose /metrics endpoint uses a different naming convention (e.g., myapp_requests instead of myapp_http_requests_total). Does the Grafana panel query work (it shouldn't — wrong metric name), and does the skill catch the mismatch before building the dashboard?
kube-prometheus-stack already installed in a different namespace: another team has already installed it as release platform-prom in namespace platform. Does the skill notice and reuse, or does it try to install a second copy into monitoring and produce a cluster-wide CRD collision?

Promote deterministic commands to `scripts/`

The repo add, the install, the verify-scrape block, and the alert-rule apply are all deterministic:

"Extract to .claude/skills/prom-grafana-instrument/scripts/install-kps.sh (no args — idempotent) and .claude/skills/prom-grafana-instrument/scripts/verify-scrape.sh (args: service name, namespace). The verify-scrape script port-forwards Prometheus, queries up{job=<service>}=1 via HTTP, and exits non-zero if the job is DOWN or missing. set -euo pipefail."

install-kps.sh should helm repo add conditionally (don't fail if it's already added), helm upgrade --install with --reuse-values so re-runs don't clobber user overrides, and kubectl wait --for=condition=available on the operator deployment.

Know the boundary

At the top of .claude/skills/prom-grafana-instrument/SKILL.md:

This skill handles: HTTP-service metrics (counters / histograms on {path, code}), kube-prometheus-stack install or reuse, ServiceMonitor creation with selector hygiene, a RED-method Grafana panel, and a scrape-gap alerting rule.
This skill does NOT handle: logs (no Loki / EFK wiring) or traces (no OpenTelemetry collector) — metrics only. No long-term storage (Thanos/Mimir). No multi-tenant scrape isolation (all ServiceMonitors are discovered cluster-wide with serviceMonitorSelectorNilUsesHelmValues=false — tighten this for shared clusters; see module 7 for RBAC-scoped scrape boundaries). No alert routing (AlertManager is installed but not configured with receivers — wire Slack / PagerDuty yourself). No dashboard-as-code beyond a single panel — use Grafonnet/grr for a real dashboard pipeline.

You're done when

A fresh Claude session instruments a different-language service, installs or reuses the stack, confirms up=1, and produces an accurate RED-method dashboard — AND when you feed it the adversarial scenarios (selector drift, metric mismatch, pre-existing stack in another namespace), it either recovers correctly or fails loudly with an actionable message.