Updated
Updated · magazine.sebastianraschka.com · Jun 27
Tutorial Shows 35B Local Coding Agents Matching 20-40 Tok/Sec With Open-Source Harnesses
Updated
Updated · magazine.sebastianraschka.com · Jun 27

Tutorial Shows 35B Local Coding Agents Matching 20-40 Tok/Sec With Open-Source Harnesses

3 articles · Updated · magazine.sebastianraschka.com · Jun 27

Summary

  • A new tutorial walks readers through building a fully local coding agent stack using Ollama plus open-source harnesses such as Qwen-Code, Codex and Claude Code.
  • Qwen3.6 35B-A3B is the main model tested: it needs about 30-40 GB of RAM, downloads at roughly 22 GB, and delivered about 40 tok/sec on a recent Mac Mini and 30 tok/sec on a DGX at 50k context.
  • Small capability checks found Qwen3.6 and Cohere's North Mini Code solved 4 of 5 benchmark tasks in Qwen-Code, while Codex sometimes outperformed Qwen's native harness and Claude Code used far more tokens.
  • The guide also urges a security audit before use, warning that coding agents can run shell commands and that Qwen-Code may send telemetry unless settings disable usage statistics, prompt logging, auto-updates and hooks.
  • The broader takeaway is that 30-35B open-weight models are now viable for many coding workflows locally, offering lower privacy risk and fixed costs if users accept setup complexity and hardware demands.

Insights

With open-source AI now rivaling cloud services, what is the true cost of running your own coding agent?
As powerful AI agents access local files, what new security vulnerabilities are we unknowingly creating?
If the agent harness matters more than the LLM, are we focusing on the wrong part of the AI stack?