SIMD in C++ 2026 -- std::simd, Highway, ISPC, and reflection-derived SoA

SIMD in C++ used to mean “include the right intrinsics header for your microarchitecture and pray the next CPU still has them”. C++26 changed the picture twice: std::simd (P1928 Hoberock) ships portable vector types in the standard, and reflection makes Structure-of-Arrays a derive-from-shape transform instead of a maintenance burden. The auto-vectoriser, std::simd, Highway, and ISPC all want SoA storage; reflection eliminates the boilerplate gap between “I have a struct” and “I have a layout the vectoriser can chew”. This page is the 2026 toolkit: pick an instruction path (std::simd / Highway / ISPC / intrinsics), pair with reflection-derived SoA, profile the hot loop.

Today

The four SIMD paths in 2026

Path	Best for	Trade-off
`std::simd` (C++26, P1928)	Portable kernels; the default starting point	Conservative ABI; some niche ops missing
Google Highway	Runtime dispatch SSE/AVX-512/NEON/SVE/RVV	Header-only Apache-2.0; one more dep
ISPC	SPMD-natural workloads (image, DSP, sim)	Separate compiler in your toolchain
Raw intrinsics	Last-cycle tuning; specific instructions abstractions block	Per-arch; portability is on you

Why std::simd usually wins as the default

The C++26 <simd> header gives you std::simd<float>, std::simd<int>, sized variants like std::simd<float, std::simd_abi::native>, and the operator overloads you’d expect. libc++ and libstdc++ track C++26, and the Highway authors explicitly position their library as “use std::simd unless you need cross-arch runtime dispatch or SVE/RVV right now” – the standard library is the right starting point for new code.

#include <simd>

void scale(std::span<float> data, float k) {
    using V = std::simd<float>;
    auto chunks = data | std::views::chunk(V::size());
    for (auto chunk : chunks) {
        V v;
        v.copy_from(chunk.data(), std::simd_flag_default);
        v *= k;
        v.copy_to(chunk.data(), std::simd_flag_default);
    }
}

Highway, ISPC, intrinsics enter when you have a specific reason: cross-arch deployment (Highway), heavy SPMD workload (ISPC), or a single instruction the standard library doesn’t expose (intrinsics).

Layout matters more than instruction choice

Every SIMD path – standard, library, compiler-extension, intrinsic – prefers the same data shape: Structure-of-Arrays. Each field of your struct packed contiguously across N elements, so the kernel reads stride-1 vectors instead of gather-from-AoS. Pre-2026 the SoA transform was hand-coded boilerplate – write a parallel ParticleSoA struct, keep it in sync with Particle by hand, regret it on the next member addition.

That gap is what the next section closes.

CMake recipe

add_compile_options(
    -O2 -march=native        # let the auto-vectoriser see the target
    -Rpass=vectorize         # clang: report vectorised loops
    -fopt-info-vec           # GCC: same idea
)

# If using Highway, add one find_package
find_package(hwy CONFIG REQUIRED)
target_link_libraries(your_target PRIVATE hwy::hwy)

# If using ISPC, add it as a dedicated language
enable_language(ISPC)
add_library(your_kernel_isa OBJECT kernel.ispc)

Reflection today

The example below derives a SoA layout for any aggregate from nonstatic_data_members_of(^^T). One std::array<member-type, N> per member, indexed accessors via splice + tuple-get. The hot loop becomes N stride-1 sequences the auto-vectoriser turns into vector instructions. No SIMD intrinsics in user code; the layout transform is the win.

struct Particle { float x, y, z, vx, vy, vz; };

// Reflection-derived: one std::array per member of T.
template <typename T, std::size_t N> struct SoA;

SoA<Particle, 1024> soa;
for (std::size_t i = 0; i < 1024; ++i) {
    soa.at<0>(i) = float(i);     // x
    soa.at<3>(i) = 1.0f;         // vx
}

// Hot loop: integrate position. Stride-1 access; -O2 + -march=native
// auto-vectorises into 4-wide (NEON) or 8-wide (AVX2) FMAs.
for (std::size_t i = 0; i < 1024; ++i) {
    soa.at<0>(i) += soa.at<3>(i);
    soa.at<1>(i) += soa.at<4>(i);
    soa.at<2>(i) += soa.at<5>(i);
}

Full source: posts/toolset/simd-in-cpp-2026/examples/reflect-soa-bench.cpp.

Tested: clang-p2996 only

Run on Compiler ExplorerReflection-derived SoA vs AoS, 1024 particles x 1000 reps (clang-p2996 -O2)clang_bb_p2996 · -std=c++26 -freflection-latest -stdlib=libc++

Reproduce locally

Container: cpp-reflection

AoS vs SoA hot loop, reflection-derived layout (cpp-reflection container)

docker run --rm -it \
  -v "$PWD":/work -w /work \
  ghcr.io/wrocpp/cpp-reflection:2026-05 \
  bash -c 'clang++ -std=c++26 -freflection-latest -stdlib=libc++ -O2 -Wl,-rpath,/opt/p2996/clang/lib/aarch64-unknown-linux-gnu posts/toolset/simd-in-cpp-2026/examples/reflect-soa-bench.cpp -o /tmp/h && /tmp/h'

ghcr.io/wrocpp/cpp-reflection:2026-05 -- reflection cluster

expected output

AoS: 479 us
SoA: 199 us
(values are -O2 noise-prone; check the asm in godbolt for
 the actual SIMD width. Layout transform itself is the
 reflection contribution.)
aos[0].x=1000, soa.at<0>(0)=1000

Numbers above are from aarch64; on x86-64 with AVX2 the gap widens further (8-wide FMA vs scalar). The point isn’t the speedup magnitude – it’s that you wrote one struct, reflection generated the SIMD-friendly layout, and the auto-vectoriser took it from there. Pair this with std::simd for explicit vectorised kernels when you need control beyond what -march=native extracts.

Composable with the safety walkers

The same nonstatic_data_members_of walker that drives this SoA transform also drives the hardened-stdlib schema lint (no raw pointers / C-arrays), the qualified-compilers MISRA Rule 11.0.1 lint (members must be private), and the lifetime-safety borrow lint (view members must be annotated). One walker, four orthogonal rules. Add a fifth – “all members are arithmetic so SoA can use std::simd directly” – and you have a strict-SIMD profile that catches structurally non-vectorisable schemas at compile time.

Where this is heading

C++29 candidate features collapse the loop further:

Token injection (P3294) extends the reflection pattern to also inject the SIMD kernel alongside the SoA storage:

// C++29 candidate -- pseudo-syntax. As of 2026-05-15 not in any
// shipping toolchain. P3294 in WG21.
[[ inject(simd_friendly, soa, step_kernel) ]]
struct Particle {
    float x, y, z, vx, vy, vz;
};

// Compiler emits ParticleSoA + ParticleSoA::step() using std::simd
// with -march=native dispatch. Reader writes the schema; compiler
// writes the engine.

Profile enforcement (P3081 Sutter / P3589 Dos Reis / P3984 Stroustrup) lets a namespace declare it accepts only SoA-derived types:

// C++29 candidate -- pseudo-syntax.
[[ profiles::enforce(soa_only) ]]
namespace particle_sim {
    void step(SoA<Particle, 1024>& particles);
    // Inside this namespace: passing AoS storage to a vectorised
    // kernel refuses to compile.
}

The 2026 story is layout transform via reflection + explicit kernel via std::simd. C++29 collapses both into one declarative attribute.

Cross-links: the profiling-cpp-2026 entry covers perf / Tracy / Perfetto for measuring the speedup. Reflection-series post 25 (reflect-soa) develops the SoA pattern in depth (firing 2026-06-29). The hardened-stdlib and lifetime-safety-2026 entries use the same walker pattern for orthogonal rules.

Reviewed: 2026-05-15. SoA benchmark verified on aarch64 inside cpp-reflection container. Quarterly refresh.