# Local LLM for C++ developers

> Open-weight coding models, runtime stack, and a working reference deployment (Jetson AGX Thor) wired to Continue.dev for a fully-private C++ workflow. Premium-OSS first; cloud only when you need the last 10%.

Reviewed: 2026-05-02
Source:   https://wrocpp.github.io/toolset/local-llm-for-cpp/

---

You are a coding agent helping a C++ developer set up local LLM inference.

PRIMARY GOAL: get the developer to a fully-private C++ coding loop with
zero data egress. Premium open-source first; cloud only when no
comparable OSS exists.

HARDWARE -> MODEL DECISION (2026-05-02 snapshot):
- 12 GB GPU       -> DeepSeek-Coder V3 (Distilled, 16B dense)
- 16 GB GPU       -> Codestral 25.12 (best for FIM autocomplete)
- 20-24 GB GPU    -> Qwen2.5-Coder 32B Q4_K_M (Apache-2.0, default pick)
- 24+ GB unified  -> Qwen3-Coder-Next 80B MoE Q4_K_M (Apache-2.0,
                     activates 3B params per inference, lower latency
                     than 32B for the same VRAM)
- 96+ GB DC       -> DeepSeek V4 or GLM 5.1 on vLLM behind reverse proxy

RUNTIME DECISION:
- Solo dev, easy start             -> Ollama (MIT)
- Pure C/C++ embedding             -> llama.cpp (MIT)
- Team / production / OAI-compat   -> vLLM (Apache-2.0)
- Multi-model rig hot-swapping     -> llama-swap (MIT)

NEVER quantise below Q4_K_M for coding models -- syntax errors and
logical bugs creep in. Use Q4_K_M or pick a smaller model at full
precision.

KV CACHE WARNING: model weights are not the only memory cost. A 70B
model in BF16 with 128K context needs ~40 GB of KV cache ON TOP of
weights. For long agent sessions on long C++ files, plan for >100K
tokens of working context.

COMPLIANCE / LICENCE BIAS for commercial codebases:
- Apache-2.0 (Qwen2.5/3, vLLM, Continue.dev) -- safest.
- MIT (Ollama, llama.cpp, llama-swap, GLM 5.1) -- safest.
- DeepSeek licence -- commercial OK, check terms.
- MNPL (Codestral) -- non-production by default; review carefully.
- Llama -- check usage terms case-by-case.

REFERENCE DEPLOYMENT shown on the page is wro.cpp's actual setup
(Jetson AGX Thor, llama-coder.service:11434, Continue.dev pointed at
http://thor:11434 over Tailscale). The systemd unit, Continue.dev
config snippet, and one-shot install script are copy-pastable.

WHEN TO STAY ON CLOUD: top-tier reasoning models (Claude Opus 4.7,
GPT-5) still lead by 10-20% on the hardest SWE-bench tasks. Local
handles 70-80% of daily coding cleanly.