What I’ve found worth trying is not trying to swap local in for Sonnet wholesale. I pick narrow, frequent, lower-stakes pipeline steps and route those locally. Keep Sonnet/Haiku for quality-critical work — and pass the output of the local model into Sonnet/Haiku as needed.
My Office M1 Max (64GB) runs as a dedicated inference host over Tailscale.
Here’s where I have local models/what they're touching:
Two things that made local models through Ollama actually work for me:
OLLAMA_KEEP_ALIVE=-1 + OLLAMA_HOST=0.0.0.0 set persistently via LaunchAgent. Without keep-alive every cold call was ~14s.
Fallback to Haiku if office unreachable. Don't make a local-model dep a hard dep.
My mental model is that local models are great Haiku-replacements for narrow tasks where you control the prompt and the output is structured (JSON, tags, scores). They get squishy fast when you give them Sonnet-shaped work — Qwen3.6-35B outputs !!!!! garbage as a failure mode under prompts the model can't hold.
Another thing I ran into: Benchmarking haiku-vs-gemma4 (or whatever model) on concrete steps in the workflow or a process gives a number to decide against. 6 wins, 4 ties, 0 losses is a better thing to decide on than vibes.
I think of locally-hosted as an augmentation, not a replacement for the platform models. I’m getting to a point where I start with the workflow and then incrementally break steps down to the more affordable or locally hosted models as I get a sense of which parts can be offloaded.
tl;drs of other tl;dr summaries in my voice) where opus won in repeated testing and beat out the other models.A test to consider as you think on offloading to local/more affordable models: is the task STRUCTURED (output shape is fixed) and NARROW (one job, not multi-step)? If yes, local can work well. If it needs to reason across a long prompt or chain steps, then consider DeepSeek. It can earn its keep there.
One more aspect that's useful to think on: FREQUENCY. High-frequency cheap tasks should fight to go local even if they're slightly degraded because you can impose a quality control check and rerun them to generate improved output. Low-frequency tasks don't justify the local complexity. Deepseek or Haiku make sense there.