Lesson 18 of 28

Module 5 · Task — Instrument your service and build a Grafana dashboard (via Claude)

doc

Checking sign-in…

The task

Drive Claude to add Prometheus metrics to the Go service from module 3, install kube-prometheus-stack in your kind cluster, create a ServiceMonitor so Prometheus scrapes your service, and build a Grafana panel showing request rate broken down by HTTP status.

Acceptance test: In Grafana, a panel with query sum by (code) (rate(hello_devops_http_requests_total[1m])) displays a non-zero rate after you curl the service ~20 times. Traffic spikes correlate with the graph within ~30 seconds.

Setup

Module 3 completed — you have the hello-devops Go service and the chart/ Helm chart in your repo.
A kind cluster running (from module 1 or 4).
kubectl and helm installed.

Drive it through Claude

Instrument the Go app. Send Claude:

"In my hello-devops repo, rewrite main.go to use github.com/prometheus/client_golang. Add a CounterVec hello_devops_http_requests_total labelled by path and code. Wrap each handler with an instrument(path, h) helper that increments the counter on completion, using a statusRecorder wrapper to capture the status code. Expose the metrics on /metrics via promhttp.Handler(). Keep the existing / and /healthz handlers."

Read the new main.go end-to-end. Ask Claude: why do I need a statusRecorder wrapper rather than just reading the status after the handler returns? If the answer isn't obvious in the code, the code isn't ready to ship.
Update go.mod + test. Send:

"Run go get github.com/prometheus/client_golang@latest and go mod tidy. Then run the app locally (go run .), curl /, /healthz, and /metrics, and show me the /metrics output filtered to the hello_devops_* lines."

Confirm you see hello_devops_http_requests_total{code="200",path="/"} 1 (or similar) in the output.
Commit + push. Send:

"Commit and push. Tell me which image tag the module-3 pipeline will produce so I can reference it in the next step."
Install kube-prometheus-stack. Send:

"Add the prometheus-community Helm repo, create a monitoring namespace, and install kube-prometheus-stack as release kps with --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false and --wait --timeout=10m. Then show me the pods in monitoring when it finishes."

Ask Claude: what does serviceMonitorSelectorNilUsesHelmValues=false actually do, and what security / RBAC trade-off am I making by setting it? (The honest answer: it tells Prometheus to scrape any ServiceMonitor in the cluster regardless of which Helm release owns it — convenient for a demo, too permissive for shared clusters where teams expect scrape isolation.)
Wire the ServiceMonitor. Send:

"In my chart/templates/service.yaml, make sure the single port is named http. Create a new file chart/templates/servicemonitor.yaml that defines a ServiceMonitor (apiVersion monitoring.coreos.com/v1) using the chart's selectorLabels, scraping endpoint port http, path /metrics, interval 15s, gated behind {{ if .Values.serviceMonitor.enabled }}. Add serviceMonitor.enabled: true to values.yaml. Then helm upgrade hello ./chart -n demo --reuse-values --wait."

Read servicemonitor.yaml. Ask: what does the selector have to match for Prometheus to find this service? what happens if my chart's labels change later?
Confirm scraping. Send:

"Port-forward the Prometheus service to localhost:9090 and tell me the URL for the targets page. Then query hello_devops_http_requests_total in the UI and confirm I see series."
Open Grafana + build the panel. Send:

"Grab the Grafana admin password from the kps-grafana secret in monitoring. Port-forward Grafana to localhost:3000. Walk me through creating a new dashboard with a single panel running sum by (code) (rate(hello_devops_http_requests_total[1m])), time range 15m, auto-refresh 5s."

A note on identity — what we just granted Prometheus

When kube-prometheus-stack installed, it created a ClusterRole and a ClusterRoleBinding that grant the Prometheus ServiceAccount cluster-wide read access to services, endpoints, pods, and configmaps across all namespaces. That's what lets Prometheus discover anything with a ServiceMonitor. The --set serviceMonitorSelectorNilUsesHelmValues=false flag then widens the selection to all ServiceMonitors, not just ones labelled for this release.

In a shared cluster, you'd do this differently: narrow the ClusterRole with a label selector, or give each team their own Prometheus instance scoped to their namespaces. The flag you set is fine for a demo; in production it's a meeting with the platform team. Module 7 will walk through scoping this properly.

Break it on purpose

Observability stacks fail silently by default. See one failure mode now.

Break the ServiceMonitor selector — change selector.matchLabels in servicemonitor.yaml to something that doesn't match your Service (e.g., add typo: yes). helm upgrade it in.
Predict: what does http://localhost:9090/targets show now? What does the Grafana panel show over the next 2 minutes?
Observe. Curl the service some more. Notice that metrics still increment inside the Pod (the counter is in-process), but Prometheus has no way to see them — and nothing in Grafana makes this obvious except that your dashboard stops moving.
Revert. helm upgrade. Confirm data returns.

The class of failure: scrape gaps are invisible unless you alert on them. A broken ServiceMonitor is indistinguishable from "no traffic" on the dashboard. Your skill will need to document this and explain how to detect it (up{job=...} in Prometheus, or an alert on absent(hello_devops_http_requests_total)).

Acceptance test

Generate traffic:

kubectl -n demo port-forward svc/hello-chart 9898:9898 &
for i in {1..20}; do curl -s http://localhost:9898/ > /dev/null; done
for i in {1..20}; do curl -s http://localhost:9898/healthz > /dev/null; done

Return to Grafana. Within ~30 seconds you should see the rate climb on code="200" series. curl http://localhost:9898/nope a few times to get a code="404" series.

What to keep for the next lesson

Keep the chart changes, the Grafana dashboard (export it as JSON — Dashboard settings → JSON Model → copy), and your Break it on purpose notes on the scrape gap. In the next lesson you'll codify .claude/skills/prom-grafana-instrument/ and teach it that a working install isn't a working observability setup — a working setup also detects its own silence.