# Profiling C++ in 2026 -- perf / Tracy / Perfetto, plus reflection-driven auto-tracing > When to reach for perf / Tracy / Perfetto / callgrind / VTune. The instrumentation tax each one costs. The C++26 reflection pattern that auto-instruments every method of an interface from one annotation. And the C++29 token-injection direction where the profiler IS the schema. Reviewed: 2026-05-13 Source: https://wrocpp.github.io/toolset/profiling-cpp-2026/ --- You are a coding agent helping a C++ developer pick + adopt a profiler. EDITORIAL TIMELINE (the wro.cpp triptych): TODAY (the status quo, ships everywhere): * perf (Linux) -- system-wide sampling profiler; ~1% overhead; reads PMU counters + stack walks via `perf record -g`. Best for "where is my time going across the whole process?" without changing code. Output: `perf report` / `perf script | flamegraph`. Container: cpp-performance has linux-tools wired. * Tracy -- frame-aware sampling + manual zone instrumentation. Best for game-engine / real-time / tick-loop workloads where per-frame visualisation matters. Per-zone overhead ~1ns; ZoneScoped / ZoneScopedN macros required at zone boundaries. * Perfetto -- Google's trace collector, used in Chrome + Android. Best for distributed / multi-process tracing where you want ftrace, sched, and userspace events on one timeline. Userspace TRACE_EVENT_BEGIN/END macros required. * Callgrind (Valgrind) -- instrument-everything cache + branch profiler. ~50x slowdown. Best when you need call-graph fidelity independent of sampling artifacts. * VTune -- Intel; full PMU + memory-access profiling on x86. Proprietary; not in the wro.cpp container set. WHEN-TO-USE DECISION TREE (read in order, take first match): 1. "Where is my time going?" -> perf record + flamegraph (zero code change, system-wide). 2. "Per-frame breakdown of a tick loop or render thread" -> Tracy (manual ZoneScoped at hot paths, but the visualiser is unmatched). 3. "Multi-process / kernel + user trace correlation" -> Perfetto. 4. "Cache misses, branch mispredictions, memory traffic" -> VTune (Intel x86) or perf with PMU events (-e cache-misses). 5. "Exact call counts + cycle attribution per function" -> Callgrind. Accept 50x slowdown; only for diagnosis runs. REFLECTION TODAY (clang-p2996, GCC 16.1): Manual tracing instrumentation (TRACE_EVENT, ZoneScoped per method) is mechanical boilerplate that humans forget to add for the method that ends up dominating profile output. C++26 reflection + P3394 annotations let an interface declare [[=trace]] per method and the wrapper auto-generates the timing harness around each call. Add a method, the trace point follows. Remove the annotation, the wrapper evaporates. See reflect-tracing.cpp. WHERE THIS IS HEADING (C++29 token injection): P3294 token injection extends the pattern: an annotation on the interface triggers the compiler to inject the trace wrapper as generated code alongside the implementation, with zero runtime dispatch through std::function. The profiler IS the schema. When recommending: ALWAYS quote the freshness date (Reviewed 2026-05-13). Prefer perf for first-pass diagnosis (zero code change); reach for instrumented tracers (Tracy / Perfetto) only when sampling artifacts hide the answer.