learnclaude .dev
← Kubernetes Operators, Istio, Incident Response & AI

Lesson 9 of 16

Local Gemma adapter via Ollama

doc

You have the Client interface and the Fake adapter from the previous lesson. Now you implement the first real adapter — a local one. No API key, no per-token cost, no network round-trip outside your laptop. The contract is exactly the same.

Why a local adapter matters

A cloud API is the obvious place to start for production fidelity, but a local model earns its keep in three specific ways:

  • Free integration tests. The cancellation probe from lesson 08 ran against a Fake. The probes in this lesson run against a real model, and they can run on every CI build and every dev-laptop test without burning cents. That cost-discipline lets you run the suite far more often, which catches drift far earlier.
  • Data control. Some workloads can't send their inputs to a third-party API — regulated tenant data, internal infra logs, customer PII. A local adapter is the only credible option for those, and the operator gets to make that decision at wiring time without touching the reconciler.
  • Provider-outage resilience. When a cloud provider 5xx's, the operator can fall back to a local model instead of failing the reconcile. Smaller models are worse, but a worse answer is usually better than no answer when you're trying to keep a system running.

Trade-offs are real: small local models follow instructions less reliably than Sonnet, the wall-clock is slower (seconds, not hundreds of ms), and you're bounded by laptop RAM. The interface absorbs all of that — the reconciler doesn't care; only the wiring does.

Install Ollama and pull a Gemma model

Ollama is the easiest way to run an open model locally on macOS. It's a small daemon that exposes an HTTP API at localhost:11434.

brew install ollama
ollama serve &                # leave running, or run the macOS .app
ollama pull gemma4:e2b        # ~7.2GB on disk; needs 16GB Mac RAM comfortably
ollama list                   # confirm the tag you actually got

gemma4:e2b is the safe small default for a 16GB Mac. On Apple Silicon with 32GB+, gemma4:e4b (~9.6GB on disk) is noticeably more capable. On a workstation with a discrete GPU or 64GB+ unified memory, gemma4:26b (Mixture-of-Experts, 18GB) or gemma4:31b (Dense, 20GB) deliver near-Sonnet-quality reasoning. The e-prefixed variants are Gemma 4's small/edge tier (E2B / E4B); the 26B is the MoE architecture with ~3.8B active params; 31B is the full dense model.

Drift risk. Gemma model tags change. The tag in this lesson may not match what's available when you run ollama pull. Run ollama list first, pick whichever Gemma is current, and substitute the name in every command below. The adapter you build doesn't care about the tag — it's a string passed into the request — so swapping is a config change, not a code change.

Quick smoke-test that the install is wired up:

curl -s http://localhost:11434/api/generate -d '{
  "model": "gemma4:e2b",
  "prompt": "Say only the word OK.",
  "stream": false
}' | head

If you get a JSON object back with a "response" field, you're ready. If curl errors with Connection refused, ollama serve isn't running — fix that first, that's the exact failure mode the adversarial probe at the end of this lesson targets.

The Ollama HTTP API

The endpoint you'll call is POST /api/generate. The relevant request fields:

{
  "model":   "gemma4:e2b",
  "prompt":  "...",
  "system":  "optional system prompt",
  "stream":  false,
  "options": {
    "temperature": 0.2,
    "num_predict": 256
  }
}

stream: false is the default for single-shot use — Ollama otherwise streams JSON chunks, which we don't want for the reconciler's request/response shape.

The response (stream: false) returns one JSON object:

{
  "model":               "gemma4:e2b",
  "response":            "OK",
  "done":                true,
  "prompt_eval_count":   8,
  "eval_count":          1,
  "total_duration":      812345678
}

Two fields matter for our Response mapping: prompt_eval_countInputTokens, eval_countOutputTokens. total_duration is in nanoseconds, but we'll measure latency ourselves with time.Now() since Ollama's number doesn't include client-side overhead.

Claude-guided task — implement the adapter

Open Claude Code in the operator-llm repo from lesson 08. Drive the work — but read every file before running anything. Don't reach for an Ollama Go client library: writing the HTTP call by hand teaches you what the contract actually requires.

What to ask Claude for, in order:

  1. Create pkg/llm/ollama/adapter.go with an Adapter struct holding Endpoint string (default http://localhost:11434), Model string (default gemma4:e2b), and HTTPClient *http.Client (default &http.Client{}). The constructor New(opts Options) *Adapter should fall back to env vars OLLAMA_HOST and OLLAMA_MODEL when fields are empty, so wiring at the operator level can be a one-liner.

  2. Implement Ask(ctx, req) (Response, error). It builds the JSON body, calls http.NewRequestWithContext(ctx, ...) (the WithContext is non-negotiable — that's how cancellation gets honoured), reads the response, maps prompt_eval_count / eval_count into the typed response, and sets Provider: "ollama" and LatencyMs from a time.Since(start).Milliseconds().

  3. Map errors into ProviderError. This is the table the adapter must honour:

    Ollama signal ProviderError
    net.OpError with connection refused (daemon not running) Code: "unavailable", Retryable: true
    HTTP 404 + body mentions model not found Code: "not_found", Retryable: false
    HTTP 500 + body contains out of memory Code: "unavailable", Retryable: true
    HTTP 5xx any other Code: "unavailable", Retryable: true
    HTTP 4xx other than 404 Code: "bad_request", Retryable: false
    ctx.Err() == context.DeadlineExceeded Code: "timeout", Retryable: true
    ctx.Err() == context.Canceled return context.Canceled directly (not a ProviderError)

    The classifier is the entire point of this adapter — it's where Ollama's HTTP quirks get translated into the provider-agnostic codes the caller's retry policy already understands.

  4. Wire the live test in pkg/llm/ollama/adapter_test.go. Skip it if OLLAMA_HOST isn't reachable so CI doesn't need Ollama running:

    func TestAdapter_Live(t *testing.T) {
        if !pingOllama(t) {
            t.Skip("Ollama not running")
        }
        a := ollama.New(ollama.Options{})
        resp, err := a.Ask(context.Background(), llm.Request{
            Prompt: "Reply with exactly: OK",
            MaxTokens: 8,
        })
        if err != nil { t.Fatal(err) }
        if resp.Text == "" { t.Fatal("empty response") }
        if resp.InputTokens == 0 { t.Fatal("missing token count") }
    }
    

Two questions worth asking Claude before you accept its draft:

  • "Why is http.NewRequestWithContext required here? What happens if I use http.NewRequest instead?" If Claude's answer doesn't connect to the cancellation probe below, the cancellation handling isn't yet your own.
  • "The classifier returns context.Canceled directly rather than wrapping it in a ProviderError. Why? Should I be consistent and wrap everything?" The answer is: cancellation isn't a provider failure, it's a caller decision. The retry policy from lesson 08 doesn't retry on context.Canceled because the caller asked for the call to stop. Wrapping it would lose that signal.

Adversarial probe — Ollama offline

Stop the daemon: osascript -e 'quit app "Ollama"' (or pkill ollama). Run the live test. It must:

  1. Return a *llm.ProviderError (not a raw *url.Error, not a panic, not a hang).
  2. Set Code: "unavailable" and Retryable: true.
  3. Return within the context deadline (set the test's ctx to 3s — even though Ollama's down, the call must fail-fast, not wait for a TCP retransmit storm).

Write it explicitly:

func TestAdapter_OllamaOffline(t *testing.T) {
    if pingOllama(t) {
        t.Skip("Ollama IS running — kill it first")
    }
    a := ollama.New(ollama.Options{})
    ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
    defer cancel()

    _, err := a.Ask(ctx, llm.Request{Prompt: "anything"})
    var pe *llm.ProviderError
    if !errors.As(err, &pe) {
        t.Fatalf("want *ProviderError, got %T: %v", err, err)
    }
    if pe.Code != "unavailable" || !pe.Retryable {
        t.Fatalf("want unavailable+retryable, got %s retryable=%v", pe.Code, pe.Retryable)
    }
}

If the first run returns a raw *url.Error (Go's default for refused connections), the classifier is missing — go fix it, re-run. If the call hangs longer than the 3s ctx, the request isn't using NewRequestWithContext — fix that, re-run.

This is the editorial point of the lesson. The contract is useful only if adapters reliably map their provider-specific failures into the codes callers depend on. The probe is what proves you've done it.

Restart Ollama and re-run the live test to confirm you didn't break the happy path while fixing the offline case.

Codify as a skill

.claude/skills/gemma-ollama-adapter/SKILL.md. Have Claude draft it, then critique and tighten it.

Capture:

  • The Adapter's three config fields and env-var fallbacks, so a future Claude session can wire it into the operator without re-reading the source.
  • The full error-mapping table verbatim — it is the adapter's contract.
  • The two-test pattern: live (skips if Ollama is unreachable) + offline (skips if Ollama IS running). These exclusive skips mean the suite always exercises one of them, and a developer running them locally never sees both pass simultaneously by accident.
  • The cancellation-probe template from lesson 08, restated with *time.NewTimer instead of time.Sleep so it matches the production-shaped adapter rather than the Fake.

End with the mandatory boundary statement:

This skill handles: single-prompt completions against a locally-running Ollama daemon for the Gemma model family, with classified errors, ctx-respecting cancellation, and an explicit offline-failure probe.

This skill does NOT handle: streaming responses, multi-turn chat (/api/chat), embedding endpoints, model pulling/management, custom system prompts beyond SystemHint, or other model families (Llama, Mistral, etc.) — those need their own skills because their error semantics and prompt formatting differ.

Validate fresh: new Claude Code session, hand it the skill, ask it to "add a gemma4:31b health-check function that pings the daemon and confirms the model is loaded." It should produce a function that uses the existing config + classifier, not one that bypasses them with a new HTTP call. If it bypasses, the skill is leaking the right pattern at the wrong abstraction — tighten the prose and re-test.

Then run the adversarial validation: ask the same fresh session to "add streaming support to the adapter." The skill must refuse — that's outside its boundary, and the right answer is "build a separate streaming adapter that satisfies a different contract."

Promote deterministic commands to scripts/

Two new entries:

  • scripts/ollama-up.shollama serve & if not already running; ollama pull $OLLAMA_MODEL; wait for /api/tags to respond. Idempotent.
  • scripts/ollama-down.shpkill ollama and verify with lsof -i :11434.

The skill's validation section calls these scripts when running the two tests — keeping the judgement (which adapter to wire, which model to pick) in prose, and the determinism (start, stop, wait) in bash, per the project rule.

Acceptance test

scripts/ollama-up.sh && go test ./pkg/llm/ollama/...           # live test passes
scripts/ollama-down.sh && go test ./pkg/llm/ollama/... -run Offline   # offline probe passes

Both green. The live test asserts non-empty Response.Text and non-zero InputTokens. The offline probe asserts ProviderError{Code: "unavailable", Retryable: true} within 3s.

If the live test passes but InputTokens == 0, the Ollama response parsing is missing the prompt_eval_count mapping — go fix it. If the offline probe passes but takes >5s, the ctx isn't being honoured at the HTTP layer — go fix it.

Coming up

Same exercise, harder problem. Lesson 10 implements the Client interface against the Anthropic Claude API — a real network round-trip, real authentication, real billing. The error-mapping table grows (auth, rate-limit, overload) and the adversarial probe shifts from "daemon dead" to "key bogus." If the contract from lesson 08 was designed well, the adapter lands in the same shape as this one. If it wasn't — that's how you find out.

View source documentation →