Claude Code Routes to 3 Local Backends, Cutting 10-50x Session Token Costs

3 articles · Updated · KDnuggets · Jun 12

A new guide shows Claude Code can be redirected from Anthropic’s API to Ollama, LM Studio, or llama.cpp by setting ANTHROPIC_BASE_URL and mapping Sonnet, Haiku, and Opus requests to local model names.
The setup targets agentic coding sessions that consume 10-50x more tokens than plain chat, aiming to eliminate per-token fees, rate limits, and external data exposure while keeping most coding tasks local.
Ollama is positioned as the easiest path after its January 2026 native Anthropic Messages API support; LM Studio added a compatible /v1/messages endpoint in version 0.4.1, while llama.cpp already supported the format.
The guide recommends at least 16 GB RAM—32 GB preferred—and highlights glm-4.7-flash as a starting model, with larger options such as qwen3-coder and devstral-small-2 for stronger coding performance.
A key fix is setting CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1 to avoid anthropic-beta header errors, underscoring that local inference is now practical without translation proxies.