Lesson 17 of 28
Module 5 · Concepts — Metrics, logs, and the three questions observability answers
The three questions
Observability isn't a tool — it's the ability to answer, at 3am on a Saturday:
- Is the service up? (healthy/unhealthy — the simplest question)
- Is the service slow or erroring? (the RED method: Rate, Errors, Duration)
- Why? (drill down: which endpoint, which pod, which dependency?)
Tools exist to answer each question. For a DevOps/Platform engineer, you're expected to know the stack that answers them, at least at a surface level.
Metrics vs logs vs traces
- Metrics — numbers over time. Cheap to store, good for aggregation (
request_rate_total,cpu_utilization), bad for the question "what exactly happened for user X at 14:37?" - Logs — events with timestamps. Rich in detail, expensive to store, good for retracing a single request.
- Traces — a structured record of one request as it moves through many services. Essential in distributed systems; overkill for a single-service setup.
This module focuses on metrics. Logs you'll add in a real job with a stack like Loki or Elasticsearch; traces with OpenTelemetry + a backend like Jaeger or Tempo. The reasoning is the same for all three — you just pick different storage.
Prometheus in 90 seconds
Prometheus is a time-series database that pulls metrics from HTTP endpoints your services expose. It scrapes /metrics on every target at a fixed interval (e.g. every 15 seconds) and stores each sample.
A metrics endpoint response looks like:
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/healthz",status="200"} 42
http_requests_total{method="GET",path="/",status="200"} 8
Each line is a metric with labels (method, path, status) and a current value. The labels are the powerful bit — they let you slice: "errors per path" is sum by (path) (rate(http_requests_total{status=~\"5..\"}[5m])).
That query language is PromQL. You don't need to memorise it — bookmark the cheat sheet and copy patterns.
Four metric types
- Counter — a number that only goes up (or resets to 0). Request counts, bytes processed.
- Gauge — a number that can go up and down. Queue depth, memory in use.
- Histogram — a distribution. Request latency is almost always a histogram.
- Summary — also a distribution, but calculated client-side. Less commonly used; prefer histograms.
The RED method instrument every service with three metrics:
- Rate —
http_requests_total(counter) - Errors —
http_requests_total{status=~"5.."}(same counter, filtered) - Duration —
http_request_duration_seconds(histogram)
Learn RED, ship RED, and you're ahead of most services.
kube-prometheus-stack
kube-prometheus-stack is the standard Helm chart that installs everything at once:
- Prometheus server + Alertmanager
- Grafana with pre-built K8s dashboards
- Prometheus Operator — watches for
ServiceMonitorCRDs and auto-configures scrape jobs
You install it once per cluster. To tell Prometheus to scrape your service, you create a ServiceMonitor alongside the Service — Prometheus Operator picks it up and adds it to the scrape config automatically. You never edit prometheus.yml by hand.
What good "instrument my service" looks like
For your module-3 Go app to be scrapable, you need to:
- Import a Prometheus client library (
github.com/prometheus/client_golang/prometheus/promhttp). - Expose
/metricson your HTTP server. - Add a
ServiceMonitorto the Helm chart so Prometheus knows to scrape it.
That's it. The task walks through every step.
Relevant links
- PromQL cheat sheet — keep it bookmarked.
- Grafana Dashboards library — don't build dashboards from scratch; import and customise.
- OpenTelemetry — the vendor-neutral instrumentation standard. Where you'd go next after metrics.