Lesson 18 of 28
Module 5 · Task — Instrument your service and build a Grafana dashboard (via Claude)
The task
Drive Claude to add Prometheus metrics to the Go service from module 3, install kube-prometheus-stack in your kind cluster, create a ServiceMonitor so Prometheus scrapes your service, and build a Grafana panel showing request rate broken down by HTTP status.
Acceptance test: In Grafana, a panel with query sum by (code) (rate(hello_devops_http_requests_total[1m])) displays a non-zero rate after you curl the service ~20 times. Traffic spikes correlate with the graph within ~30 seconds.
Setup
- Module 3 completed — you have the
hello-devopsGo service and thechart/Helm chart in your repo. - A
kindcluster running (from module 1 or 4). kubectlandhelminstalled.
Drive it through Claude
Instrument the Go app. Send Claude:
"In my
hello-devopsrepo, rewritemain.goto usegithub.com/prometheus/client_golang. Add a CounterVechello_devops_http_requests_totallabelled bypathandcode. Wrap each handler with aninstrument(path, h)helper that increments the counter on completion, using astatusRecorderwrapper to capture the status code. Expose the metrics on/metricsviapromhttp.Handler(). Keep the existing/and/healthzhandlers."Read the new
main.goend-to-end. Ask Claude: why do I need astatusRecorderwrapper rather than just reading the status after the handler returns? If the answer isn't obvious in the code, the code isn't ready to ship.Update
go.mod+ test. Send:"Run
go get github.com/prometheus/client_golang@latestandgo mod tidy. Then run the app locally (go run .), curl/,/healthz, and/metrics, and show me the/metricsoutput filtered to thehello_devops_*lines."Confirm you see
hello_devops_http_requests_total{code="200",path="/"} 1(or similar) in the output.Commit + push. Send:
"Commit and push. Tell me which image tag the module-3 pipeline will produce so I can reference it in the next step."
Install kube-prometheus-stack. Send:
"Add the
prometheus-communityHelm repo, create amonitoringnamespace, and installkube-prometheus-stackas releasekpswith--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=falseand--wait --timeout=10m. Then show me the pods inmonitoringwhen it finishes."Ask Claude: what does
serviceMonitorSelectorNilUsesHelmValues=falseactually do, and what security / RBAC trade-off am I making by setting it? (The honest answer: it tells Prometheus to scrape anyServiceMonitorin the cluster regardless of which Helm release owns it — convenient for a demo, too permissive for shared clusters where teams expect scrape isolation.)Wire the ServiceMonitor. Send:
"In my
chart/templates/service.yaml, make sure the single port is namedhttp. Create a new filechart/templates/servicemonitor.yamlthat defines aServiceMonitor(apiVersionmonitoring.coreos.com/v1) using the chart'sselectorLabels, scraping endpoint porthttp, path/metrics, interval 15s, gated behind{{ if .Values.serviceMonitor.enabled }}. AddserviceMonitor.enabled: truetovalues.yaml. Thenhelm upgrade hello ./chart -n demo --reuse-values --wait."Read
servicemonitor.yaml. Ask: what does the selector have to match for Prometheus to find this service? what happens if my chart's labels change later?Confirm scraping. Send:
"Port-forward the Prometheus service to localhost:9090 and tell me the URL for the targets page. Then query
hello_devops_http_requests_totalin the UI and confirm I see series."Open Grafana + build the panel. Send:
"Grab the Grafana admin password from the
kps-grafanasecret inmonitoring. Port-forward Grafana to localhost:3000. Walk me through creating a new dashboard with a single panel runningsum by (code) (rate(hello_devops_http_requests_total[1m])), time range 15m, auto-refresh 5s."
A note on identity — what we just granted Prometheus
When kube-prometheus-stack installed, it created a ClusterRole and a ClusterRoleBinding that grant the Prometheus ServiceAccount cluster-wide read access to services, endpoints, pods, and configmaps across all namespaces. That's what lets Prometheus discover anything with a ServiceMonitor. The --set serviceMonitorSelectorNilUsesHelmValues=false flag then widens the selection to all ServiceMonitors, not just ones labelled for this release.
In a shared cluster, you'd do this differently: narrow the ClusterRole with a label selector, or give each team their own Prometheus instance scoped to their namespaces. The flag you set is fine for a demo; in production it's a meeting with the platform team. Module 7 will walk through scoping this properly.
Break it on purpose
Observability stacks fail silently by default. See one failure mode now.
- Break the ServiceMonitor selector — change
selector.matchLabelsinservicemonitor.yamlto something that doesn't match your Service (e.g., addtypo: yes).helm upgradeit in. - Predict: what does http://localhost:9090/targets show now? What does the Grafana panel show over the next 2 minutes?
- Observe. Curl the service some more. Notice that metrics still increment inside the Pod (the counter is in-process), but Prometheus has no way to see them — and nothing in Grafana makes this obvious except that your dashboard stops moving.
- Revert.
helm upgrade. Confirm data returns.
The class of failure: scrape gaps are invisible unless you alert on them. A broken ServiceMonitor is indistinguishable from "no traffic" on the dashboard. Your skill will need to document this and explain how to detect it (up{job=...} in Prometheus, or an alert on absent(hello_devops_http_requests_total)).
Acceptance test
Generate traffic:
kubectl -n demo port-forward svc/hello-chart 9898:9898 &
for i in {1..20}; do curl -s http://localhost:9898/ > /dev/null; done
for i in {1..20}; do curl -s http://localhost:9898/healthz > /dev/null; done
Return to Grafana. Within ~30 seconds you should see the rate climb on code="200" series. curl http://localhost:9898/nope a few times to get a code="404" series.
What to keep for the next lesson
Keep the chart changes, the Grafana dashboard (export it as JSON — Dashboard settings → JSON Model → copy), and your Break it on purpose notes on the scrape gap. In the next lesson you'll codify .claude/skills/prom-grafana-instrument/ and teach it that a working install isn't a working observability setup — a working setup also detects its own silence.