Lesson 17 of 28

Module 5 · Concepts — Metrics, logs, and the three questions observability answers

doc

Checking sign-in…

The three questions

Observability isn't a tool — it's the ability to answer, at 3am on a Saturday:

Is the service up? (healthy/unhealthy — the simplest question)
Is the service slow or erroring? (the RED method: Rate, Errors, Duration)
Why? (drill down: which endpoint, which pod, which dependency?)

Tools exist to answer each question. For a DevOps/Platform engineer, you're expected to know the stack that answers them, at least at a surface level.

Metrics vs logs vs traces

Metrics — numbers over time. Cheap to store, good for aggregation (request_rate_total, cpu_utilization), bad for the question "what exactly happened for user X at 14:37?"
Logs — events with timestamps. Rich in detail, expensive to store, good for retracing a single request.
Traces — a structured record of one request as it moves through many services. Essential in distributed systems; overkill for a single-service setup.

This module focuses on metrics. Logs you'll add in a real job with a stack like Loki or Elasticsearch; traces with OpenTelemetry + a backend like Jaeger or Tempo. The reasoning is the same for all three — you just pick different storage.

Prometheus in 90 seconds

Prometheus is a time-series database that pulls metrics from HTTP endpoints your services expose. It scrapes /metrics on every target at a fixed interval (e.g. every 15 seconds) and stores each sample.

A metrics endpoint response looks like:

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/healthz",status="200"} 42
http_requests_total{method="GET",path="/",status="200"} 8

Each line is a metric with labels (method, path, status) and a current value. The labels are the powerful bit — they let you slice: "errors per path" is sum by (path) (rate(http_requests_total{status=~\"5..\"}[5m])).

That query language is PromQL. You don't need to memorise it — bookmark the cheat sheet and copy patterns.

Four metric types

Counter — a number that only goes up (or resets to 0). Request counts, bytes processed.
Gauge — a number that can go up and down. Queue depth, memory in use.
Histogram — a distribution. Request latency is almost always a histogram.
Summary — also a distribution, but calculated client-side. Less commonly used; prefer histograms.

The RED method instrument every service with three metrics:

Rate — http_requests_total (counter)
Errors — http_requests_total{status=~"5.."} (same counter, filtered)
Duration — http_request_duration_seconds (histogram)

Learn RED, ship RED, and you're ahead of most services.

kube-prometheus-stack

kube-prometheus-stack is the standard Helm chart that installs everything at once:

Prometheus server + Alertmanager
Grafana with pre-built K8s dashboards
Prometheus Operator — watches for ServiceMonitor CRDs and auto-configures scrape jobs

You install it once per cluster. To tell Prometheus to scrape your service, you create a ServiceMonitor alongside the Service — Prometheus Operator picks it up and adds it to the scrape config automatically. You never edit prometheus.yml by hand.

What good "instrument my service" looks like

For your module-3 Go app to be scrapable, you need to:

Import a Prometheus client library (github.com/prometheus/client_golang/prometheus/promhttp).
Expose /metrics on your HTTP server.
Add a ServiceMonitor to the Helm chart so Prometheus knows to scrape it.

That's it. The task walks through every step.

Relevant links

PromQL cheat sheet — keep it bookmarked.
Grafana Dashboards library — don't build dashboards from scratch; import and customise.
OpenTelemetry — the vendor-neutral instrumentation standard. Where you'd go next after metrics.

View source documentation →