Design the LLM interface contract

You have a cluster, a mesh, and a sample app to break on purpose. Before we write the operator's controller, we need a reliable way for the operator's reconcile loop to ask an LLM a question and get a structured answer back — without hard-wiring the operator to a single provider's SDK.

This lesson designs that contract. The next two lessons implement it against two genuinely different backends (local Ollama, then the Anthropic Claude API), and a later module wires the result into the reconcile loop.

Why an interface, not a direct SDK call

In a toy demo, calling a provider's SDK directly from your reconcile code looks fine. Two months in, it isn't:

Provider availability. When your chosen provider 5xx's at scale, the operator can't reconcile. A pluggable client lets you fall back to a second provider, or to a local model, without redeploying the operator's image.
Cost gating per tenant. Production operators often need to route different namespaces or CRs to different providers — a paid customer hits Sonnet, an internal smoke test hits a free local model. That decision belongs at the wiring layer, not in the reconciler.
Tests stay free and deterministic. A reconciler with a hardcoded provider call can't be tested without either network access, API keys, or recorded fixtures. A reconciler that depends on an interface can be tested with a deterministic stub that returns whatever the test needs.
Future swap. The provider you'd pick today is unlikely to be the right pick in twelve months. An interface is the cheapest way to keep the option open.

The pattern looks like over-engineering until the first time you need any of those four things — at which point it's the cheapest hour you ever spent. Real production AI integrations are almost always built this way; the SDK call is the implementation detail, not the API.

The contract

Here is the interface this module produces. Read it first — every design choice in this lesson is justified by something it does or refuses to do:

package llm

import "context"

type Client interface {
    Ask(ctx context.Context, req Request) (Response, error)
}

type Request struct {
    Prompt      string
    SystemHint  string
    MaxTokens   int
    Temperature float64
    JSONSchema  any // optional: structured-output hint, provider-specific encoding
}

type Response struct {
    Text         string
    Provider     string
    LatencyMs    int64
    InputTokens  int
    OutputTokens int
}

type ProviderError struct {
    Code      string // "rate_limit" | "timeout" | "auth" | "unavailable" | "bad_request" | "not_found"
    Retryable bool
    Wrapped   error
}

func (e *ProviderError) Error() string { return e.Code + ": " + e.Wrapped.Error() }
func (e *ProviderError) Unwrap() error { return e.Wrapped }

Four design decisions sit inside that code:

ctx is the only timeout knob. Callers set deadlines on the context (ctx, cancel := context.WithTimeout(parent, 30*time.Second)). Adapters must honour cancellation — no hidden SDK-internal timeouts the caller can't override. The reconciler will set tight deadlines so a hung LLM call doesn't wedge the worker; the contract makes that possible.
Errors are classified by the adapter, not the caller. A ProviderError tells the caller what kind of failure happened and whether to retry. The caller's retry policy can be one piece of code that works across providers, because the classifier sits next to the provider's quirks (Anthropic's 529 means "overloaded, retry"; Ollama's connection refused means "process not running, retry"; both map to unavailable, retryable: true).
JSONSchema is a hint, not a guarantee. Text is always the raw model output. Callers that need structured output unmarshal at the call site. Different providers have different (and rapidly evolving) structured-output APIs; promising a typed result in the interface would either lie or pin us to whichever provider has the weakest support.
Token usage rides on the response. Cost tracking and rate-limit budgeting belong in caller code — but only if the caller knows how many tokens each call cost. Returning input/output token counts in Response keeps that information available without forcing provider-specific accounting elsewhere.

The interface is intentionally small. One method, four primitives in, six primitives out. Anything provider-specific that can't be expressed here — streaming, tool use, multi-turn chat, prompt caching — is not part of this contract. It belongs in a sibling interface or a different abstraction altogether.

Claude-guided task — scaffold the module and a Fake adapter

You'll build this in a new repo, not inside the operator project (which doesn't exist yet). Treat the module as a standalone Go library — it should be usable from any service that needs an LLM call, not just K8s operators.

mkdir ~/operator-llm && cd ~/operator-llm
go mod init github.com/<your-handle>/operator-llm

From inside that directory, start Claude Code. Drive the work through it — but read every file Claude writes before you run anything. The point of this lesson is the design, not the typing.

What to ask Claude for, in order:

Scaffold the llm package in pkg/llm/ with the Client interface, Request, Response, and ProviderError types exactly as shown above. Ask Claude to add idiomatic doc comments on each — read them. If a comment says something the interface doesn't actually enforce, push back.
Build a Fake adapter in pkg/llm/fake/. It satisfies Client and returns canned responses keyed on prompt prefix (e.g. prompts starting "echo: " return the suffix; prompts starting "fail:rate_limit" return a ProviderError{Code: "rate_limit", Retryable: true}; an unmatched prompt returns a generic stub). Pass a small time.Duration field so tests can simulate slow models. The Fake is what the eventual reconciler unit tests will use — it's not throwaway code.
A RetryPolicy helper in pkg/llm/retry/. Takes a Client, wraps Ask, retries up to N times with exponential backoff only when ProviderError.Retryable is true. The retry policy is a separate type so adapters don't reinvent it.
Table-driven tests in pkg/llm/fake/fake_test.go covering: echo case, error case (with retry policy asserting one attempt for non-retryable), context cancellation (next section).

Before running anything: read the generated fake.go and retry.go end to end. Two specific questions worth asking Claude about its choices:

"How does the Fake's time.Sleep interact with ctx.Done()? If a test cancels the context mid-sleep, does the call return immediately or wait the full duration?" If the answer is "wait the full duration", that's a bug you'll catch in the adversarial probe below.
"Why is ProviderError a pointer receiver on Error()? What changes if I make it value-receiver?" If Claude can't articulate the answer, the design isn't yet your own.

Project rule. Stay close to your problem. AI acceleration is never a license to disengage; the lesson's value is in the four design decisions above, which Claude will defer to you on if you ask. Don't accept code whose rationale you can't repeat back.

Adversarial probe — context cancellation

The interface promises that ctx controls cancellation. The Fake will probably break that promise in its first iteration, which is exactly why this test exists.

Ask Claude to write this test:

func TestFake_RespectsContextCancellation(t *testing.T) {
    f := fake.New(fake.Config{Latency: 200 * time.Millisecond})
    ctx, cancel := context.WithCancel(context.Background())
    cancel() // already cancelled before Ask runs

    start := time.Now()
    _, err := f.Ask(ctx, llm.Request{Prompt: "echo: hi"})
    elapsed := time.Since(start)

    if !errors.Is(err, context.Canceled) {
        t.Fatalf("want context.Canceled, got %v", err)
    }
    if elapsed > 50*time.Millisecond {
        t.Fatalf("call took %v; should return immediately on cancelled ctx", elapsed)
    }
}

Run it. If the Fake naively time.Sleep(200ms) regardless of the context, the test fails — call took 200ms, not <50ms. Fix the Fake to select { case <-time.After(latency): case <-ctx.Done(): return ..., ctx.Err() }, re-run. Test passes.

This is the editorial point. The interface alone doesn't force adapters to respect ctx — it only declares the expectation. The test is what enforces it. Without the probe, the Fake's bug would have shipped silently and the reconciler that depended on cancellable LLM calls would have one wedged worker per stuck call.

Every adapter in the next two lessons gets the same probe in its own form.

Codify as a skill

Open .claude/skills/llm-interface-design/SKILL.md (have Claude write the first draft, then critique and tighten it):

The skill should capture:

The four design decisions above, each as a single-sentence rule.
The Fake-adapter pattern: deterministic, prompt-prefix-keyed, configurable latency, ctx-respecting.
The retry-policy split: classification is the adapter's job, the policy is one piece of code.
The cancellation-probe test as a template — every adapter gets one.

End the SKILL.md with the mandatory boundary statement (project rule):

This skill handles: designing a single-shot request/response LLM interface for Go services, with classified retryability, ctx-respecting cancellation, and a deterministic Fake adapter.

This skill does NOT handle: streaming responses, function/tool calling, multi-turn conversation, embeddings, prompt caching. Each of those is a separate contract.

Validate the skill in a fresh Claude Code session — close this one, start a new one with no history, point it at the skill, and ask: "Design a single-shot LLM interface for a Python web service that summarises support tickets." Read the output. It should produce something shaped like your Go contract but expressed idiomatically in Python — same four decisions, different language. If it produces something Go-shaped, the skill is too implementation-coupled; tighten it and re-test.

Then run the adversarial validation: in the same fresh session, ask it to "design a streaming chat interface". The skill must refuse — that's outside its boundary. A skill that quietly drifts into out-of-scope work is worse than no skill, because it produces confident-looking code that misses the point.

Promote deterministic commands to `scripts/`

scripts/test.sh is one line — go test ./pkg/llm/... — and it lets the skill (or a future CI run) verify the contract without prose ambiguity. The skill's "validation" section should call this script, not paraphrase what it does.

Acceptance test

Run from the module root:

go test ./pkg/llm/...

All green. ≥5 table-driven cases on the Fake, including the cancellation probe and at least one retryable + one non-retryable error case. The retry policy's test asserts that a non-retryable ProviderError produces exactly one attempt.

If any test is flaky or order-dependent, the Fake isn't deterministic enough — fix it before moving on.

Coming up

Two adapters next. Lesson 09 implements Client against a local Ollama server running Gemma — no API key, no per-token cost, but the same contract. Lesson 10 implements it against the Anthropic Claude API — paid, faster, larger context. Both reuse the retry policy and the cancellation probe you just built. If the contract is genuinely provider-shaped (rather than provider-locked), both adapters will land without forcing the interface to change. If one of them forces a change, you'll see exactly where the contract was wrong — that's the lesson.