SIMD in C++ used to mean “include the right intrinsics header for your microarchitecture and pray the next CPU still has them”. C++26 changed the picture twice: std::simd (P1928 Hoberock) ships portable vector types in the standard, and reflection makes Structure-of-Arrays a derive-from-shape transform instead of a maintenance burden. The auto-vectoriser, std::simd, Highway, and ISPC all want SoA storage; reflection eliminates the boilerplate gap between “I have a struct” and “I have a layout the vectoriser can chew”. This page is the 2026 toolkit: pick an instruction path (std::simd / Highway / ISPC / intrinsics), pair with reflection-derived SoA, profile the hot loop.
Today
The four SIMD paths in 2026
| Path | Best for | Trade-off |
|---|---|---|
std::simd (C++26, P1928) | Portable kernels; the default starting point | Conservative ABI; some niche ops missing |
| Google Highway | Runtime dispatch SSE/AVX-512/NEON/SVE/RVV | Header-only Apache-2.0; one more dep |
| ISPC | SPMD-natural workloads (image, DSP, sim) | Separate compiler in your toolchain |
| Raw intrinsics | Last-cycle tuning; specific instructions abstractions block | Per-arch; portability is on you |
Why std::simd usually wins as the default
The C++26 <simd> header gives you std::simd<float>, std::simd<int>, sized variants like std::simd<float, std::simd_abi::native>, and the operator overloads you’d expect. libc++ and libstdc++ track C++26, and the Highway authors explicitly position their library as “use std::simd unless you need cross-arch runtime dispatch or SVE/RVV right now” — the standard library is the right starting point for new code.
#include <simd>
void scale(std::span<float> data, float k) {
using V = std::simd<float>;
auto chunks = data | std::views::chunk(V::size());
for (auto chunk : chunks) {
V v;
v.copy_from(chunk.data(), std::simd_flag_default);
v *= k;
v.copy_to(chunk.data(), std::simd_flag_default);
}
}
Highway, ISPC, intrinsics enter when you have a specific reason: cross-arch deployment (Highway), heavy SPMD workload (ISPC), or a single instruction the standard library doesn’t expose (intrinsics).
Layout matters more than instruction choice
Every SIMD path — standard, library, compiler-extension, intrinsic — prefers the same data shape: Structure-of-Arrays. Each field of your struct packed contiguously across N elements, so the kernel reads stride-1 vectors instead of gather-from-AoS. Pre-2026 the SoA transform was hand-coded boilerplate — write a parallel ParticleSoA struct, keep it in sync with Particle by hand, regret it on the next member addition.
That gap is what the next section closes.
CMake recipe
add_compile_options(
-O2 -march=native # let the auto-vectoriser see the target
-Rpass=vectorize # clang: report vectorised loops
-fopt-info-vec # GCC: same idea
)
# If using Highway, add one find_package
find_package(hwy CONFIG REQUIRED)
target_link_libraries(your_target PRIVATE hwy::hwy)
# If using ISPC, add it as a dedicated language
enable_language(ISPC)
add_library(your_kernel_isa OBJECT kernel.ispc)
Reflection today
The example below derives a SoA layout for any aggregate from nonstatic_data_members_of(^^T). One std::array<member-type, N> per member, indexed accessors via splice + tuple-get. The hot loop becomes N stride-1 sequences the auto-vectoriser turns into vector instructions. No SIMD intrinsics in user code; the layout transform is the win.
struct Particle { float x, y, z, vx, vy, vz; };
// Reflection-derived: one std::array per member of T.
template <typename T, std::size_t N> struct SoA;
SoA<Particle, 1024> soa;
for (std::size_t i = 0; i < 1024; ++i) {
soa.at<0>(i) = float(i); // x
soa.at<3>(i) = 1.0f; // vx
}
// Hot loop: integrate position. Stride-1 access; -O2 + -march=native
// auto-vectorises into 4-wide (NEON) or 8-wide (AVX2) FMAs.
for (std::size_t i = 0; i < 1024; ++i) {
soa.at<0>(i) += soa.at<3>(i);
soa.at<1>(i) += soa.at<4>(i);
soa.at<2>(i) += soa.at<5>(i);
}
Full source: posts/toolset/simd-in-cpp-2026/examples/reflect-soa-bench.cpp .
Reproduce locally
docker run --rm -it \
-v "$PWD":/work -w /work \
ghcr.io/wrocpp/cpp-reflection:2026-05 \
bash -c 'clang++ -std=c++26 -freflection-latest -stdlib=libc++ -O2 -Wl,-rpath,/opt/p2996/clang/lib/aarch64-unknown-linux-gnu posts/toolset/simd-in-cpp-2026/examples/reflect-soa-bench.cpp -o /tmp/h && /tmp/h'expected output
AoS: 479 us
SoA: 199 us
(values are -O2 noise-prone; check the asm in godbolt for
the actual SIMD width. Layout transform itself is the
reflection contribution.)
aos[0].x=1000, soa.at<0>(0)=1000Numbers above are from aarch64; on x86-64 with AVX2 the gap widens further (8-wide FMA vs scalar). The point isn’t the speedup magnitude — it’s that you wrote one struct, reflection generated the SIMD-friendly layout, and the auto-vectoriser took it from there. Pair this with std::simd for explicit vectorised kernels when you need control beyond what -march=native extracts.
Composable with the safety walkers
The same nonstatic_data_members_of walker that drives this SoA transform also drives the hardened-stdlib schema lint (no raw pointers / C-arrays), the qualified-compilers MISRA Rule 11.0.1 lint (members must be private), and the lifetime-safety borrow lint (view members must be annotated). One walker, four orthogonal rules. Add a fifth — “all members are arithmetic so SoA can use std::simd directly” — and you have a strict-SIMD profile that catches structurally non-vectorisable schemas at compile time.
Where this is heading
C++29 candidate features collapse the loop further:
Token injection (P3294) extends the reflection pattern to also inject the SIMD kernel alongside the SoA storage:
// C++29 candidate -- pseudo-syntax. As of 2026-05-15 not in any
// shipping toolchain. P3294 in WG21.
[[ inject(simd_friendly, soa, step_kernel) ]]
struct Particle {
float x, y, z, vx, vy, vz;
};
// Compiler emits ParticleSoA + ParticleSoA::step() using std::simd
// with -march=native dispatch. Reader writes the schema; compiler
// writes the engine.
Profile enforcement (P3081 Sutter / P3589 Dos Reis / P3984 Stroustrup) lets a namespace declare it accepts only SoA-derived types:
// C++29 candidate -- pseudo-syntax.
[[ profiles::enforce(soa_only) ]]
namespace particle_sim {
void step(SoA<Particle, 1024>& particles);
// Inside this namespace: passing AoS storage to a vectorised
// kernel refuses to compile.
}
The 2026 story is layout transform via reflection + explicit kernel via std::simd. C++29 collapses both into one declarative attribute.
Cross-links: the profiling-cpp-2026 entry covers perf / Tracy / Perfetto for measuring the speedup. Reflection-series post 25 (reflect-soa) develops the SoA pattern in depth (firing 2026-06-29). The hardened-stdlib and lifetime-safety-2026 entries use the same walker pattern for orthogonal rules.
Reviewed: 2026-05-15. SoA benchmark verified on aarch64 inside cpp-reflection container. Quarterly refresh.