Lesson 9 of 16
Local Gemma adapter via Ollama
You have the Client interface and the Fake adapter from the previous lesson. Now you implement the first real adapter — a local one. No API key, no per-token cost, no network round-trip outside your laptop. The contract is exactly the same.
Why a local adapter matters
A cloud API is the obvious place to start for production fidelity, but a local model earns its keep in three specific ways:
- Free integration tests. The cancellation probe from lesson 08 ran against a Fake. The probes in this lesson run against a real model, and they can run on every CI build and every dev-laptop test without burning cents. That cost-discipline lets you run the suite far more often, which catches drift far earlier.
- Data control. Some workloads can't send their inputs to a third-party API — regulated tenant data, internal infra logs, customer PII. A local adapter is the only credible option for those, and the operator gets to make that decision at wiring time without touching the reconciler.
- Provider-outage resilience. When a cloud provider 5xx's, the operator can fall back to a local model instead of failing the reconcile. Smaller models are worse, but a worse answer is usually better than no answer when you're trying to keep a system running.
Trade-offs are real: small local models follow instructions less reliably than Sonnet, the wall-clock is slower (seconds, not hundreds of ms), and you're bounded by laptop RAM. The interface absorbs all of that — the reconciler doesn't care; only the wiring does.
Install Ollama and pull a Gemma model
Ollama is the easiest way to run an open model locally on macOS. It's a small daemon that exposes an HTTP API at localhost:11434.
brew install ollama
ollama serve & # leave running, or run the macOS .app
ollama pull gemma4:e2b # ~7.2GB on disk; needs 16GB Mac RAM comfortably
ollama list # confirm the tag you actually got
gemma4:e2b is the safe small default for a 16GB Mac. On Apple Silicon with 32GB+, gemma4:e4b (~9.6GB on disk) is noticeably more capable. On a workstation with a discrete GPU or 64GB+ unified memory, gemma4:26b (Mixture-of-Experts, 18GB) or gemma4:31b (Dense, 20GB) deliver near-Sonnet-quality reasoning. The e-prefixed variants are Gemma 4's small/edge tier (E2B / E4B); the 26B is the MoE architecture with ~3.8B active params; 31B is the full dense model.
Drift risk. Gemma model tags change. The tag in this lesson may not match what's available when you run
ollama pull. Runollama listfirst, pick whichever Gemma is current, and substitute the name in every command below. The adapter you build doesn't care about the tag — it's a string passed into the request — so swapping is a config change, not a code change.
Quick smoke-test that the install is wired up:
curl -s http://localhost:11434/api/generate -d '{
"model": "gemma4:e2b",
"prompt": "Say only the word OK.",
"stream": false
}' | head
If you get a JSON object back with a "response" field, you're ready. If curl errors with Connection refused, ollama serve isn't running — fix that first, that's the exact failure mode the adversarial probe at the end of this lesson targets.
The Ollama HTTP API
The endpoint you'll call is POST /api/generate. The relevant request fields:
{
"model": "gemma4:e2b",
"prompt": "...",
"system": "optional system prompt",
"stream": false,
"options": {
"temperature": 0.2,
"num_predict": 256
}
}
stream: false is the default for single-shot use — Ollama otherwise streams JSON chunks, which we don't want for the reconciler's request/response shape.
The response (stream: false) returns one JSON object:
{
"model": "gemma4:e2b",
"response": "OK",
"done": true,
"prompt_eval_count": 8,
"eval_count": 1,
"total_duration": 812345678
}
Two fields matter for our Response mapping: prompt_eval_count → InputTokens, eval_count → OutputTokens. total_duration is in nanoseconds, but we'll measure latency ourselves with time.Now() since Ollama's number doesn't include client-side overhead.
Claude-guided task — implement the adapter
Open Claude Code in the operator-llm repo from lesson 08. Drive the work — but read every file before running anything. Don't reach for an Ollama Go client library: writing the HTTP call by hand teaches you what the contract actually requires.
What to ask Claude for, in order:
Create
pkg/llm/ollama/adapter.gowith anAdapterstruct holdingEndpoint string(defaulthttp://localhost:11434),Model string(defaultgemma4:e2b), andHTTPClient *http.Client(default&http.Client{}). The constructorNew(opts Options) *Adaptershould fall back to env varsOLLAMA_HOSTandOLLAMA_MODELwhen fields are empty, so wiring at the operator level can be a one-liner.Implement
Ask(ctx, req) (Response, error). It builds the JSON body, callshttp.NewRequestWithContext(ctx, ...)(theWithContextis non-negotiable — that's how cancellation gets honoured), reads the response, mapsprompt_eval_count/eval_countinto the typed response, and setsProvider: "ollama"andLatencyMsfrom atime.Since(start).Milliseconds().Map errors into
ProviderError. This is the table the adapter must honour:Ollama signal → ProviderErrornet.OpErrorwithconnection refused(daemon not running)Code: "unavailable", Retryable: trueHTTP 404 + body mentions model not foundCode: "not_found", Retryable: falseHTTP 500 + body contains out of memoryCode: "unavailable", Retryable: trueHTTP 5xx any other Code: "unavailable", Retryable: trueHTTP 4xx other than 404 Code: "bad_request", Retryable: falsectx.Err() == context.DeadlineExceededCode: "timeout", Retryable: truectx.Err() == context.Canceledreturn context.Canceleddirectly (not a ProviderError)The classifier is the entire point of this adapter — it's where Ollama's HTTP quirks get translated into the provider-agnostic codes the caller's retry policy already understands.
Wire the live test in
pkg/llm/ollama/adapter_test.go. Skip it ifOLLAMA_HOSTisn't reachable so CI doesn't need Ollama running:func TestAdapter_Live(t *testing.T) { if !pingOllama(t) { t.Skip("Ollama not running") } a := ollama.New(ollama.Options{}) resp, err := a.Ask(context.Background(), llm.Request{ Prompt: "Reply with exactly: OK", MaxTokens: 8, }) if err != nil { t.Fatal(err) } if resp.Text == "" { t.Fatal("empty response") } if resp.InputTokens == 0 { t.Fatal("missing token count") } }
Two questions worth asking Claude before you accept its draft:
- "Why is
http.NewRequestWithContextrequired here? What happens if I usehttp.NewRequestinstead?" If Claude's answer doesn't connect to the cancellation probe below, the cancellation handling isn't yet your own. - "The classifier returns
context.Canceleddirectly rather than wrapping it in aProviderError. Why? Should I be consistent and wrap everything?" The answer is: cancellation isn't a provider failure, it's a caller decision. The retry policy from lesson 08 doesn't retry oncontext.Canceledbecause the caller asked for the call to stop. Wrapping it would lose that signal.
Adversarial probe — Ollama offline
Stop the daemon: osascript -e 'quit app "Ollama"' (or pkill ollama). Run the live test. It must:
- Return a
*llm.ProviderError(not a raw*url.Error, not a panic, not a hang). - Set
Code: "unavailable"andRetryable: true. - Return within the context deadline (set the test's
ctxto 3s — even though Ollama's down, the call must fail-fast, not wait for a TCP retransmit storm).
Write it explicitly:
func TestAdapter_OllamaOffline(t *testing.T) {
if pingOllama(t) {
t.Skip("Ollama IS running — kill it first")
}
a := ollama.New(ollama.Options{})
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
defer cancel()
_, err := a.Ask(ctx, llm.Request{Prompt: "anything"})
var pe *llm.ProviderError
if !errors.As(err, &pe) {
t.Fatalf("want *ProviderError, got %T: %v", err, err)
}
if pe.Code != "unavailable" || !pe.Retryable {
t.Fatalf("want unavailable+retryable, got %s retryable=%v", pe.Code, pe.Retryable)
}
}
If the first run returns a raw *url.Error (Go's default for refused connections), the classifier is missing — go fix it, re-run. If the call hangs longer than the 3s ctx, the request isn't using NewRequestWithContext — fix that, re-run.
This is the editorial point of the lesson. The contract is useful only if adapters reliably map their provider-specific failures into the codes callers depend on. The probe is what proves you've done it.
Restart Ollama and re-run the live test to confirm you didn't break the happy path while fixing the offline case.
Codify as a skill
.claude/skills/gemma-ollama-adapter/SKILL.md. Have Claude draft it, then critique and tighten it.
Capture:
- The Adapter's three config fields and env-var fallbacks, so a future Claude session can wire it into the operator without re-reading the source.
- The full error-mapping table verbatim — it is the adapter's contract.
- The two-test pattern: live (skips if Ollama is unreachable) + offline (skips if Ollama IS running). These exclusive skips mean the suite always exercises one of them, and a developer running them locally never sees both pass simultaneously by accident.
- The cancellation-probe template from lesson 08, restated with
*time.NewTimerinstead oftime.Sleepso it matches the production-shaped adapter rather than the Fake.
End with the mandatory boundary statement:
This skill handles: single-prompt completions against a locally-running Ollama daemon for the Gemma model family, with classified errors, ctx-respecting cancellation, and an explicit offline-failure probe.
This skill does NOT handle: streaming responses, multi-turn chat (
/api/chat), embedding endpoints, model pulling/management, custom system prompts beyondSystemHint, or other model families (Llama, Mistral, etc.) — those need their own skills because their error semantics and prompt formatting differ.
Validate fresh: new Claude Code session, hand it the skill, ask it to "add a gemma4:31b health-check function that pings the daemon and confirms the model is loaded." It should produce a function that uses the existing config + classifier, not one that bypasses them with a new HTTP call. If it bypasses, the skill is leaking the right pattern at the wrong abstraction — tighten the prose and re-test.
Then run the adversarial validation: ask the same fresh session to "add streaming support to the adapter." The skill must refuse — that's outside its boundary, and the right answer is "build a separate streaming adapter that satisfies a different contract."
Promote deterministic commands to scripts/
Two new entries:
scripts/ollama-up.sh—ollama serve &if not already running;ollama pull $OLLAMA_MODEL; wait for/api/tagsto respond. Idempotent.scripts/ollama-down.sh—pkill ollamaand verify withlsof -i :11434.
The skill's validation section calls these scripts when running the two tests — keeping the judgement (which adapter to wire, which model to pick) in prose, and the determinism (start, stop, wait) in bash, per the project rule.
Acceptance test
scripts/ollama-up.sh && go test ./pkg/llm/ollama/... # live test passes
scripts/ollama-down.sh && go test ./pkg/llm/ollama/... -run Offline # offline probe passes
Both green. The live test asserts non-empty Response.Text and non-zero InputTokens. The offline probe asserts ProviderError{Code: "unavailable", Retryable: true} within 3s.
If the live test passes but InputTokens == 0, the Ollama response parsing is missing the prompt_eval_count mapping — go fix it. If the offline probe passes but takes >5s, the ctx isn't being honoured at the HTTP layer — go fix it.
Coming up
Same exercise, harder problem. Lesson 10 implements the Client interface against the Anthropic Claude API — a real network round-trip, real authentication, real billing. The error-mapping table grows (auth, rate-limit, overload) and the adversarial probe shifts from "daemon dead" to "key bogus." If the contract from lesson 08 was designed well, the adapter lands in the same shape as this one. If it wasn't — that's how you find out.