Swapping in local models for sonnet/haiku and others

What I’ve found worth trying is not trying to swap local in for Sonnet wholesale. I pick narrow, frequent, lower-stakes pipeline steps and route those locally. Keep Sonnet/Haiku for quality-critical work — and pass the output of the local model into Sonnet/Haiku as needed.

My Office M1 Max (64GB) runs as a dedicated inference host over Tailscale.

Here’s where I have local models/what they're touching:

qwen3-coder:30b — code review pass on file diffs before I see them. ~1.65s warm latency over Tailscale.
gemma3:1b — real-time intent extraction for memory scorer. Speed > quality; keyword-classification-shaped, not reasoning.
gemma4:12b — first-pass memory tagging / extraction. Hallucinates occasionally but acceptable for frequent free batch work.
Haiku stays for compound-tag generation on writes. I benchmarked 10 samples Haiku vs gemma4:12b and Haiku won 6, tied 4, lost 0. The cheap-but-good model owns the quality-critical step, the free local model owns the cheap-frequent one.

Two things that made local models through Ollama actually work for me:

OLLAMA_KEEP_ALIVE=-1 + OLLAMA_HOST=0.0.0.0 set persistently via LaunchAgent. Without keep-alive every cold call was ~14s.
Fallback to Haiku if office unreachable. Don't make a local-model dep a hard dep.

My mental model is that local models are great Haiku-replacements for narrow tasks where you control the prompt and the output is structured (JSON, tags, scores). They get squishy fast when you give them Sonnet-shaped work — Qwen3.6-35B outputs !!!!! garbage as a failure mode under prompts the model can't hold.

Another thing I ran into: Benchmarking haiku-vs-gemma4 (or whatever model) on concrete steps in the workflow or a process gives a number to decide against. 6 wins, 4 ties, 0 losses is a better thing to decide on than vibes.

I think of locally-hosted as an augmentation, not a replacement for the platform models. I’m getting to a point where I start with the workflow and then incrementally break steps down to the more affordable or locally hosted models as I get a sense of which parts can be offloaded.

local = anything structured, narrow, prompt-controlled. tags, classification, JSON extraction, intent parsing, code review on small diffs (qwen3-coder:30b handles this fine), short summaries. The failure mode is “squishy but recoverable.”
deepseek = cheap reasoning at scale. long-context, multi-step, code gen across files. The failure mode is "needs internet” and you pay something (low cost). The floor of DeepSeek's capability is way above the ceiling of any local model and pricing is quite good.
haiku/sonnet -p or api = quality-critical structured output (like compound tags), or sonnet-shaped reasoning where DeepSeek’s quirks bite.
opus = the hard stuff. design, adversarial review, anything where being wrong is expensive. I have some in-app summaries (structured tl;drs of other tl;dr summaries in my voice) where opus won in repeated testing and beat out the other models.

A test to consider as you think on offloading to local/more affordable models: is the task STRUCTURED (output shape is fixed) and NARROW (one job, not multi-step)? If yes, local can work well. If it needs to reason across a long prompt or chain steps, then consider DeepSeek. It can earn its keep there.

One more aspect that's useful to think on: FREQUENCY. High-frequency cheap tasks should fight to go local even if they're slightly degraded because you can impose a quality control check and rerun them to generate improved output. Low-frequency tasks don't justify the local complexity. Deepseek or Haiku make sense there.