Claude Code is one of the best agentic coding tools out there — it lives in your terminal, understands your codebase, and handles complex multi-file tasks with natural language. The catch? Running it on Anthropic’s official API can get expensive fast. If you’re on it daily, you’re looking at $100–$200/month on the MAX plan.
Here’s the good news: Claude Code is, at its core, just a client that speaks the Anthropic Messages API format. It doesn’t verify that there’s an actual Claude model on the other end. That means you can point it at any local inference server that speaks the same API format — and in 2026, there are several excellent options for doing exactly that.
This guide covers everything you need: how to get local models running, which models are best for code generation, and how to configure Claude Code to use them.
Why Run Local Models with Claude Code?
- Cost: Local models are free to run after the initial hardware investment. Third-party alternatives can save you up to 98% compared to Anthropic’s flagship API pricing.
- Privacy: Your code never leaves your machine. Ideal for sensitive codebases, proprietary projects, or anything you’d rather not send to a third-party server.
- No rate limits: Run as many prompts as you like without throttling.
- Offline capability: Once the model is downloaded, you don’t need an internet connection.
The trade-off is a small dip in raw capability compared to Claude Sonnet or Opus — but for most day-to-day development tasks, a well-chosen local model is more than sufficient.
Option 1: Ollama (Easiest)
Ollama is the simplest way to get a local LLM running. It handles model downloads, quantisation, and serving in a single tool, and now includes native support for the Anthropic Messages API — which is exactly what Claude Code needs.
Install Ollama
On macOS or Linux:
curl -fsSL https://ollama.com/install.sh | sh
On Windows, download the installer from ollama.com.
Pull a Model
# Good starting point (16GB RAM)
ollama pull qwen2.5-coder:14b
# Lighter option (8GB RAM)
ollama pull qwen2.5-coder:7b
# High-end option (32GB RAM)
ollama pull devstral-small-2
Important: Claude Code is context-heavy. Set the context length to at least 32k tokens in your Ollama settings — 64k is recommended if your hardware can handle it.
Configure Claude Code to use Ollama
Set these environment variables before launching Claude Code:
export ANTHROPIC_BASE_URL=https://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
Then launch Claude Code with your chosen model:
claude --model qwen2.5-coder:14b
To make these variables persist across terminal sessions, add the export lines to your ~/.zshrc or ~/.bashrc.
Ollama also has :cloud variants that run on cloud infrastructure using the same commands — no API keys needed. Useful if you want to try a larger model without local hardware:
ollama pull kimi-k2.5:cloud
claude --model kimi-k2.5:cloud
Option 2: LM Studio (Best GUI Experience)
LM Studio gives you a graphical interface for browsing, downloading, and running models. It’s particularly good if you prefer not to work entirely in the terminal. Since version 0.4.1, it includes an Anthropic-compatible /v1/messages endpoint.
Install LM Studio
Download from lmstudio.ai. On a server or VM, you can use the CLI installer:
curl -fsSL https://lmstudio.ai/install.sh | bash
Start the server
lms server start --port 1234
Then set your environment variables:
export ANTHROPIC_BASE_URL=https://localhost:1234
export ANTHROPIC_AUTH_TOKEN=lmstudio
And launch Claude Code:
claude --model openai/your-model-name
LM Studio recommends starting with a context size of at least 25K tokens and increasing it for better results.
Option 3: llama.cpp (Most Control)
For the most control over inference settings — quantisation, KV cache type, batch size, and so on — llama.cpp is the way to go. It has native support for the Anthropic Messages API, so no proxy or translation layer is needed.
Install llama.cpp
On Apple Silicon Mac:
brew install llama.cpp
On Ubuntu/Linux with CUDA, build from source for best performance.
Download a GGUF model
MODEL_PATH="$(uvx hf download unsloth/GLM-4.7-Flash-GGUF GLM-4.7-Flash-UD-Q4_K_XL.gguf)"
Start the llama server
llama-server
--model "$MODEL_PATH"
--alias "my-model"
--temp 1.0
--top-p 0.95
--port 8001
--ctx-size 131072
--flash-attn on
Point Claude Code at it
export ANTHROPIC_BASE_URL=https://localhost:8001
claude --model my-model
Fix the KV Cache Performance Issue
This is the gotcha that catches most people. Claude Code adds an attribution header to every request, which invalidates the KV cache on local models — making inference roughly 90% slower. The fix is a one-line change to your Claude settings file.
Edit (or create) ~/.claude/settings.json and add:
{
"env": {
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
}
}
Note: Using export CLAUDE_CODE_ATTRIBUTION_HEADER=0 in the terminal does NOT work — it must be in the settings file.
Best Local Models for Code Generation
The local model landscape has matured rapidly. Here’s what’s worth running in 2026, broken down by hardware.
8GB RAM — Tight Budget
- Qwen2.5-Coder 7B — The best coding model at this size tier. Scores around 76% on HumanEval, outperforming general-purpose models twice its size on code tasks.
ollama pull qwen2.5-coder:7b - Phi-4 Mini (3.8B) — Exceptional reasoning performance per parameter. Good fallback if you’re very RAM-constrained.
16–24GB RAM — The Sweet Spot
- Qwen2.5-Coder 14B — The top-rated local coding model for this tier. Noticeably better at multi-file reasoning and complex algorithm implementation. HumanEval score around 85%.
ollama pull qwen2.5-coder:14b - GLM-4.7-Flash — Strong value/latency tradeoff, particularly good quantised. Works well on a 24GB device.
- Devstral Small 2 (24B) — Mistral’s coding-focused model, scores 68% on SWE-bench Verified. Runs on a single RTX 4090 or a Mac with 32GB RAM. Apache 2.0 licensed.
32GB+ RAM — High Performance
- Qwen3-Coder (30B MoE variant) — Alibaba’s agentic coding model, trained specifically for multi-step coding workflows with tool calling and file editing. Handles 256K+ context. The current community favourite for serious local agentic work.
- Qwen3 Coder Next (80B MoE, 3B active) — Scores 70.6% on SWE-bench Verified with only 3B active parameters. Exceptional at tool use and following MCP documentation. Currently one of the best local models for use with Claude Code specifically.
- Devstral 2 (123B) — Mistral’s flagship coding model. 72.2% on SWE-bench Verified, 256K context. Requires serious hardware but delivers near-frontier results.
Quick Hardware Reference
| RAM | Recommended Model | Pull Command |
|---|---|---|
| 8GB | Qwen2.5-Coder 7B | ollama pull qwen2.5-coder:7b |
| 16GB | Qwen2.5-Coder 14B | ollama pull qwen2.5-coder:14b |
| 24GB | Devstral Small 2 | ollama pull devstral-small-2 |
| 32GB | Qwen3-Coder 30B | ollama pull qwen3-coder:30b |
| 64GB+ | Qwen3 Coder Next | ollama pull qwen3-coder-next |
Configuring Claude Code: settings.json
Beyond the attribution header fix, you can configure Claude Code’s behaviour persistently via ~/.claude/settings.json. Here’s a useful starting config for local model use:
{
"env": {
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0",
"ANTHROPIC_BASE_URL": "https://localhost:11434",
"ANTHROPIC_AUTH_TOKEN": "ollama"
}
}
With this in place, you don’t need to set environment variables every time you open a terminal — Claude Code will pick them up automatically.
Model-specific tips
- Context window: Local models need a large context window to work well with Claude Code. Set it to at least 32K tokens; 64K or more is better. In Ollama, configure this in the model settings widget.
- If you get a ConnectionRefused error: Make sure your inference server (Ollama/LM Studio/llama-server) is actually running before launching Claude Code. Run
ollama listor checkhttps://localhost:11434in your browser. - To switch back to the Anthropic API: Run
unset ANTHROPIC_BASE_URLin your terminal, or remove the env entries from settings.json.
Putting It All Together
The simplest end-to-end setup in 2026 looks like this:
- Install Ollama
- Pull Qwen2.5-Coder 14B (or whatever fits your hardware)
- Add the attribution header fix and connection settings to
~/.claude/settings.json - Set context length to 32K+ in Ollama settings
- Run
claude --model qwen2.5-coder:14binside your project folder
That’s genuinely it. What used to require fragile adapters and hacks is now a five-step process. The capability gap between local and cloud models has narrowed considerably — for the kinds of tasks you’d use Claude Code for daily (code completion, refactoring, debugging, explaining existing codebases), a good local model covers the vast majority of use cases.
If you try this out, let me know in the comments which model and setup you land on — it’s an area that’s moving fast and the community recommendations keep evolving.

Leave a Reply
You must be logged in to post a comment.