For C++ developers in 2026, local LLM inference is no longer a curiosity. Open-weight coding models close the gap on cloud APIs for the bulk of daily work, and proprietary / regulated codebases (defence, embedded, finance, gaming engines) make it a hard requirement. This page walks through the hardware-to-model-to-runtime stack, then shows a working reference deployment wro.cpp runs in production.
Why local
Four reasons, in order of how often we hear them.
- Data sovereignty. Every prompt sent to a cloud API leaves the developer’s machine and passes through third-party infrastructure. For proprietary engines, customer code, or regulated industries (HIPAA, GDPR, ITAR, financial services), that is a non-starter. Local inference keeps everything on-device or on-prem.
- Zero marginal cost. Once the hardware is paid for, the compute is “free” relative to per-token API billing. For a heavy agent user (~$60-200/mo on Cursor or Claude.ai), break-even on a 24 GB GPU is roughly 12-18 months.
- No rate limits. Cloud agents throttle. Local agents do not. For batch refactors across hundreds of files, this is the difference between “ship today” and “ship Monday”.
- Latency stability. Local inference is not on the wrong side of someone else’s incident. The model does not get rate-limited at 09:00 UTC because every developer in EMEA woke up at the same time.
The honest counter: top-tier cloud reasoning models (Claude Opus 4.7, GPT-5) still lead local by roughly 10-20% on the hardest SWE-bench Verified tasks. Local handles the 70-80% of daily coding cleanly; reach for cloud on the gnarly remaining work.
Hardware reality (the part nobody tells you upfront)
Model weights are not the only memory cost. A coding agent on long C++ files chews through context fast (5-10 headers + .cpp + sanitizer output + build log easily exceeds 50K tokens), and KV cache scales linearly with context.
For a Llama-class 70B model in BF16, KV cache is roughly 0.31 MB per token. A 128K-token context takes ~40 GB of KV cache ON TOP of the ~40 GB of weights. So that “70B fits on a 32 GB card” claim from the marketing slide assumes a 4K context window, which is useless for a coding agent.
Practical implication: the headroom you want is roughly 1.5x the model size for short contexts and 2-3x for long-context coding work.
(Note: Google’s TurboQuant (ICLR 2026) compresses KV cache to 3 bits with 6x reduction and up to 8x speedup, but it is not yet implemented in mainline llama.cpp or Ollama. Watch this space.)
Model picks by hardware tier
| Tier | Pick | Score | Licence | Notes |
|---|---|---|---|---|
| 12 GB | DeepSeek-Coder V3 (Distilled, 16B dense) | 40.5% SWE-bench Verified | DeepSeek licence (commercial use OK with attribution) | Highest reasoning per GB. Runs on RTX 4070 Ti / 4080 Mobile / 12 GB laptop GPUs. Q4_K_M only -- never below. |
| 16 GB | Codestral 25.12 | FIM-optimised; fast inline completions | MNPL (Mistral Non-Production License) -- check terms for commercial use | Best for IDE autocomplete loops in VS Code. Pair with Continue.dev FIM mode. |
| 20-24 GB | Qwen2.5-Coder 32B (Q4_K_M) | 87% HumanEval, 79% MBPP, 40+ languages, 128K context, FIM | Apache-2.0 | Strongest single-GPU all-rounder. RTX 3090 / 4090 / 5090 class. Recommended default for most C++ shops. |
| 24+ GB / unified memory | Qwen3-Coder-Next 80B MoE (Q4_K_M, ~48 GB) | Leading open-weight on agentic SWE-bench | Apache-2.0 | wro.cpp reference: Jetson AGX Thor (126 GB unified LPDDR5X) running llama-coder.service on port 11434. Activates only 3B params per inference (MoE), so latency is closer to the 32B all-rounder. |
| 96+ GB / data centre | DeepSeek V4 / GLM 5.1 | Frontier open-weight; closes the gap on cloud Tier-A models | DeepSeek licence / MIT (GLM 5.1) | vLLM behind a reverse proxy. Team-scale serving (continuous batching, OpenAI-compatible endpoint). RTX PRO 6000 (96 GB) or A100/H100. |
Quantisation guardrail. Never use Q3_K_S or below for coding models. The syntax errors and logical bugs that creep in are not worth the RAM saved. Either use Q4_K_M (sweet spot) or pick a smaller model at full precision.
FIM (fill-in-the-middle). For inline IDE autocomplete you need an FIM-capable model. Qwen2.5-Coder, Qwen3-Coder-Next, DeepSeek-Coder, and Starcoder2 all support FIM. Llama 3.1 8B general does not.
Runtime stack
Ollama MIT
Best for: Solo developer, fastest start (`ollama run qwen2.5-coder:32b`), zero config.
Watch out: Lifecycle bugs in long agent sessions (mid-session unloads, context drift, broken bf16). Migrate to llama-swap if you hit them.
llama.cpp MIT
Best for: Pure-C/C++ runtime, no Python. Embeddable. Apple Metal / NVIDIA CUDA / AMD ROCm / AVX2/512. Raspberry Pi to H100 with one binary.
Watch out: Per-model flag tuning (e.g. Qwen 3.x reasoning models need `--reasoning-format none`).
vLLM Apache-2.0
Best for: Team deployment. Continuous batching, OpenAI-compatible endpoint, production throughput.
Watch out: Heavier setup; expects dedicated GPU server hardware (A100 / H100).
llama-swap MIT
Best for: Multi-model lifecycle wrapper around llama-server. Hot-swap between coder + chat + embedding models on demand.
Watch out: Each model needs specific flags wired into the config. Worth the effort for multi-model rigs.
Reference deployment (the wro.cpp loop)
This is the actual setup we use day to day. Hardware, service file, client config — everything copy-pastable.
Hardware: Jetson AGX Thor (126 GB unified LPDDR5X, NVIDIA Thor compute 11.0)
Model: Qwen3-Coder-Next 80B MoE Q4_K_M (~48 GB on disk, ~52 GB in use)
Server: llama.cpp build with CUDA backend
Service: /etc/systemd/system/llama-coder.service (port 11434)
Client: Continue.dev plugin in VS Code, pointed at http://thor:11434
Network: Tailscale (Thor reachable at 100.112.201.66 from anywhere)
/etc/systemd/system/llama-coder.service
[Unit]
Description=llama.cpp coder model server (Qwen3-Coder-Next 80B)
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=llama
Group=llama
WorkingDirectory=/opt/llama-models
ExecStart=/opt/llama.cpp/build/bin/llama-server \
--model /opt/llama-models/qwen3-coder-next-80b-Q4_K_M.gguf \
--host 0.0.0.0 --port 11434 \
--n-gpu-layers 999 \
--ctx-size 65536 \
--parallel 2 \
--cont-batching \
--threads 16 \
--reasoning-format none
Restart=on-failure
RestartSec=10
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
Activate:
sudo systemctl daemon-reload
sudo systemctl enable --now llama-coder.service
curl http://thor:11434/v1/models
Continue.dev ~/.continue/config.json snippet
{
"models": [
{
"title": "Qwen3-Coder-Next (Thor)",
"provider": "openai",
"model": "qwen3-coder-next",
"apiBase": "http://thor:11434/v1",
"apiKey": "no-auth-tailscale-only",
"completionOptions": {
"temperature": 0.2,
"topP": 0.95,
"maxTokens": 4096
}
}
],
"tabAutocompleteModel": {
"title": "Qwen FIM (Thor)",
"provider": "openai",
"model": "qwen3-coder-next",
"apiBase": "http://thor:11434/v1",
"template": "qwen2.5-coder"
},
"systemMessage": "You are a C++26 expert. Prefer concepts and constexpr. RAII for every owned resource. No exceptions in hot paths -- use std::expected. Match the project's existing style; check AGENTS.md before suggesting deviations."
}
One-shot bring-up (from a fresh Ubuntu 24.04 box)
For a quick start without the full systemd setup:
# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull and run the 32B coder (use Qwen3-Coder-Next on bigger boxes)
ollama pull qwen2.5-coder:32b
ollama run qwen2.5-coder:32b "// write a C++26 reflection-based JSON serializer"
# 3. Install Continue.dev in VS Code
code --install-extension Continue.continue
# 4. Point Continue at Ollama (in ~/.continue/config.json):
# apiBase: http://localhost:11434/v1
You now have a fully-local C++ coding loop. Total setup time: 30 minutes from a clean box, model download dominated.
Privacy posture
Three concrete things to harden once you are running locally.
- Audit runtime logging. Both Ollama and llama-server log requests by default in dev mode. For sensitive code, set the runtime to log only metrics (latency, token counts) and never the prompt body. Set restrictive permissions on log directories (
chmod 700). - Treat GGUF files as supply-chain risk. A GGUF model file is a binary blob that your inference engine loads directly into memory. Pin to specific SHAs from official Hugging Face repositories of the model authors; do not pull from random mirrors.
- Prefer permissive licences for commercial work. Apache-2.0 (Qwen, Continue.dev, vLLM) and MIT (llama.cpp, Ollama, GLM 5.1) are the safest. Llama-licensed models, Codestral (MNPL), and DeepSeek-licensed models all need a careful licence read for commercial deployment.
Cost transparency
For the wro.cpp Thor deployment:
- Hardware capex: Jetson AGX Thor ~$3499 (Aug 2025 launch list; check current pricing). Could be a refurbished workstation + RTX 3090 24 GB at ~$1500.
- Marginal compute: $0 per request once running. Power draw ~30-60W idle, ~150W active.
- Cloud equivalent: Same throughput tier on Claude.ai Ultra ~$200/mo; on a vLLM-hosted Qwen2.5-Coder-32B endpoint ~$300-800/mo depending on RPM.
- Break-even: ~6-18 months versus a heavy cloud subscription, faster if you are paying for multiple seats.
When to keep using cloud
Honesty section. Cloud is still ahead in two places:
- Hardest reasoning tasks. Claude Opus 4.7 / GPT-5 still lead open-weight by roughly 10-20% on the upper tail of SWE-bench Verified. For genuinely hard multi-file C++ refactors with subtle template / lifetime issues, cloud is worth the egress.
- No hardware budget. A new $20/mo Cursor subscription gets you most of the value with no capex. The local-first argument is strongest when (a) you have a regulated codebase, (b) you have multiple developers (the per-seat math compounds), or (c) you already own GPU hardware.
The pragmatic stack we recommend most often: Continue.dev + local Qwen2.5-Coder-32B for the daily 80%, Claude Code (cloud) for the hard 20%. You only pay for the cloud tier when you reach for it.