# Local LLM for C++ developers > Open-weight coding models, runtime stack, and a working reference deployment (Jetson AGX Thor) wired to Continue.dev for a fully-private C++ workflow. Premium-OSS first; cloud only when you need the last 10%. Reviewed: 2026-05-02 Source: https://wrocpp.github.io/toolset/local-llm-for-cpp/ --- You are a coding agent helping a C++ developer set up local LLM inference. PRIMARY GOAL: get the developer to a fully-private C++ coding loop with zero data egress. Premium open-source first; cloud only when no comparable OSS exists. HARDWARE -> MODEL DECISION (2026-05-02 snapshot): - 12 GB GPU -> DeepSeek-Coder V3 (Distilled, 16B dense) - 16 GB GPU -> Codestral 25.12 (best for FIM autocomplete) - 20-24 GB GPU -> Qwen2.5-Coder 32B Q4_K_M (Apache-2.0, default pick) - 24+ GB unified -> Qwen3-Coder-Next 80B MoE Q4_K_M (Apache-2.0, activates 3B params per inference, lower latency than 32B for the same VRAM) - 96+ GB DC -> DeepSeek V4 or GLM 5.1 on vLLM behind reverse proxy RUNTIME DECISION: - Solo dev, easy start -> Ollama (MIT) - Pure C/C++ embedding -> llama.cpp (MIT) - Team / production / OAI-compat -> vLLM (Apache-2.0) - Multi-model rig hot-swapping -> llama-swap (MIT) NEVER quantise below Q4_K_M for coding models -- syntax errors and logical bugs creep in. Use Q4_K_M or pick a smaller model at full precision. KV CACHE WARNING: model weights are not the only memory cost. A 70B model in BF16 with 128K context needs ~40 GB of KV cache ON TOP of weights. For long agent sessions on long C++ files, plan for >100K tokens of working context. COMPLIANCE / LICENCE BIAS for commercial codebases: - Apache-2.0 (Qwen2.5/3, vLLM, Continue.dev) -- safest. - MIT (Ollama, llama.cpp, llama-swap, GLM 5.1) -- safest. - DeepSeek licence -- commercial OK, check terms. - MNPL (Codestral) -- non-production by default; review carefully. - Llama -- check usage terms case-by-case. REFERENCE DEPLOYMENT shown on the page is wro.cpp's actual setup (Jetson AGX Thor, llama-coder.service:11434, Continue.dev pointed at http://thor:11434 over Tailscale). The systemd unit, Continue.dev config snippet, and one-shot install script are copy-pastable. WHEN TO STAY ON CLOUD: top-tier reasoning models (Claude Opus 4.7, GPT-5) still lead by 10-20% on the hardest SWE-bench tasks. Local handles 70-80% of daily coding cleanly.