Add AVX-512 support by Shnatsel · Pull Request #231 · linebender/fearless_simd

Shnatsel · 2026-05-24T21:09:01Z

Yes, really. It's all here. In one humongous PR. Sorry 😅

This is probably best reviewed commit-by-commit. The first commit is still big because the history was getting really messy with changes and rollbacks, and squashing it made it less of a mess.

This also touches other backends in three ways:

set_mask() is now a backend method so it could be specialized per-level
Changes to mask conversion routines to support different internal representations bled into other levels. It occasionally adds an intermediate array but it gets optimized out in practice.
transmute_copy() is wrapped into checked_transmute_copy() and the raw version disallowed after I almost had a horrible accident with it. This could be its own PR but I wanted the insurance right away.

Everything changed here should be covered by tests. I've expanded test coverage where it was lacking.

…edicated AVX-512 implementations for complex int/float vector operations that benefit the most. LLM summary of the changes: Implemented: - Added `X86::Avx512` in the generator with Ice Lake feature set, `native_width = 512`, `max_block_size = 512`. - Generated new `fearless_simd/src/generated/avx512.rs`. - Wired public API: `Avx512`, `x86::Avx512`, `Level::Avx512`, `Level::as_avx512`, dispatch, and `kernel!` support. - Updated runtime/static detection so Ice Lake AVX-512 is selected before AVX2, while `as_avx2()` and `as_sse4_2()` downgrade correctly. - Bumped MSRV/docs/CI/check-target metadata to Rust 1.89. Generator/backend behavior: - 512-bit vectors use native `__m512`, `__m512d`, and `__m512i`. - AVX-512 masks now use raw compact `__mmask8/16/32/64` storage, with no aligned wrapper. - Generic `SimdFrom<__mmask*, S>` / `From<mask*, __mmask*>` now route through `from_bitmask` / `to_bitmask`, so they are correct for non-AVX-512 `S` too. - Added AVX-512 compare/select paths using mask-returning compares and mask blends. - Added direct conversion paths, including `f32 <-> i32/u32` and `u8 <-> u16`. - Added AVX-512 vector slides for vectors only; masks intentionally have no slide support. - Added dedicated AVX-512 zip/unzip/interleave/deinterleave using `permutex2var`, especially for 256/512-bit widths. Tests/coverage: - Extended `#[simd_test]` to include AVX-512. - Added AVX-512 detection/dispatch coverage. - Updated mask bitwise tests for canonical boolean mask lanes. - Added a regression test that AVX-512 mask public types are compact and match `__mmask*` sizes.

…nt the spooky bug I almost introduced

…rage for these ops.

…calar, now we use the dedicated intrinsics.

…ackend, and specialize it for AVX-512. Add test coverage that sets every single bit and verifies it was set correctly.

… test to exercise it. i8/u8 test is still bad because of rust-lang/rust#156891

…rage. Only for 8-bit left shift LLVM autovectorizes the scalar fallback into GFNI instructions on 256-bit halves which emits more instructions but schedules better and ends up being slightly faster according to llvm-mca on sapphire rapids; but the difference isn't huge and I don't want to rely on autovectorization because of its fragility.

…it vectors on AVX-512; expand test coverage

… no cost to throughput

…ide test

… so they didn't show up earlier when I removed those methods.

…e get dead code warnings

…ppy --tests` without a reported location, I've failed to isolate it to a specific crate and suppress it there

LaurenzV · 2026-05-25T06:18:15Z

I think it would indeed be great to have a custom PR for 3.

Shnatsel · 2026-05-25T08:47:09Z

It will cause a lot of conflicts if I try to split it, but I have it isolated to its own commit at least: f08f7e6

…an't enforce Pod without an external dependency.

Shnatsel · 2026-05-25T12:10:45Z

I've run miri on the entire test suite (minus the exhaustive tests that have opt-outs) and filtered out the tests that fail due to unsupported intrinsics. SSE4.2, AVX2 and AVX-512 all pass.

# Conflicts: # fearless_simd/src/generated/avx2.rs # fearless_simd/src/generated/neon.rs # fearless_simd/src/generated/sse4_2.rs # fearless_simd/src/generated/wasm.rs # fearless_simd_gen/src/generic.rs # fearless_simd_gen/src/level.rs

"we have `bytemuck` at home" These are the `transmute_copy()` safety improvements from linebender#231 taken one step further: on top of checking the sizes match at compile time and that both types are `Copy`, we also add a marker trait to types that are safe to transmute and require it in `checked_transmute_copy()`, making that function fully safe to call. The design follows the `bytemuck` crate closely. We could just use `bytemuck`, but it would pull in `syn` and other proc macro machinery, and we would still have to write the `impl_aligned_simd_pod!` macro because bytemuck rightfully refuses to derive `Pod` on generic wrappers like `Aligned128<T>` because they may introduce padding depending on the `T`. So the amount of code it saves us is minimal. Rebasing linebender#231 on top of this would be hellish, but a merge commit probably wouldn't be too bad.

…ame name but different semantics from the production code to avoid confusion

…VX-512 has configurable comparison modes that we can use to implement the advertised _precise semantics.

Merged origin/main commit 13dd530. The merge applied cleanly.

Replace AVX512 interleaved load intrinsics emitted by the branch with checked_transmute_copy, then regenerate the generated AVX512 module.

Merged origin/main commit fbc97da. The merge applied cleanly.

Regenerate the branch-added AVX512 module so by-value transmutes use checked_transmute_copy, matching PR linebender#234. Validation: cargo test

Merged origin/main commit 0d13b0a. The merge applied cleanly.

Regenerate the branch-added AVX512 module so reference casts use checked_cast_ref and checked_cast_mut. Also apply the float bit-pattern assertion style from PR linebender#235 to the branch-added f32x16 interleaved-load test. Validation: cargo test

Merged origin/main commit 650815d. The merge applied cleanly.

PR linebender#237 only updates NEON load construction. The AVX512 branch-specific unsafe load sites were already adapted in the PR linebender#233 follow-up, and a search found no remaining load intrinsics needing the linebender#237 pattern.

Shnatsel added 20 commits May 24, 2026 18:24

Add checked_transmute_copy and ban transmute_copy to statically preve…

f08f7e6

…nt the spooky bug I almost introduced

Expand native type conversion test coverage

aef1cac

Rename test: mask_methods.rs -> mask_roundtrip.rs

c12a7cc

Check in the new generated AVX-512 file

9d9adf8

Fix build after file rename

81441cf

Use AVX-512 instructions for f32 -> u32 conversions. Expand test cove…

0d6af5d

…rage for these ops.

Optimize load_array/as_array on AVX-512 masks; the initial impl was s…

025c172

…calar, now we use the dedicated intrinsics.

Split set_mask into a backend method so it could be specialized per b…

7927383

…ackend, and specialize it for AVX-512. Add test coverage that sets every single bit and verifies it was set correctly.

Optimize load_interleaved/store_interleaved for AVX-512. Add one more…

57de129

… test to exercise it. i8/u8 test is still bad because of rust-lang/rust#156891

Optimize floor/ceil/round_ties_even/trunc/approximate_recip for 512-b…

f2ba8c9

…it vectors on AVX-512; expand test coverage

Use AVX-512 rcp14 for smaller vector sizes too; improves precision at…

9cddbb2

… no cost to throughput

Optimize slide_within_blocks for AVX-512; verified with exhaustive sl…

9d02c3a

…ide test

Remove stale tests for mask slide APIs; they were under #[cfg(false)]…

85b44c9

… so they didn't show up earlier when I removed those methods.

consistent clippy error messages

1c558ca

satisfy Clippy

6c8f7d7

get rid of useless extra braces

e475ae1

KISS the native type mask roundtrip tests

6f1081f

cargo fmt

1e2a096

Shnatsel mentioned this pull request May 24, 2026

Initial AVX-512 support #201

Closed

Shnatsel added 6 commits May 24, 2026 22:15

Satisfy clippy some more. Hoisted by my own restriction lint.

7fc16d4

Satisfy the toml formatting check

359650d

Stick an #[expect] onto checked_transmute_copy on wasm32, otherwise w…

37df3e3

…e get dead code warnings

Suppress an apparently buggy Clippy lint; surfaced only in `cargo cli…

8825bfb

…ppy --tests` without a reported location, I've failed to isolate it to a specific crate and suppress it there

Satisfy the toml formatter again

cf3ff7d

Add miri out-outs for extra slow tests

cb5780f

Shnatsel mentioned this pull request May 24, 2026

Set up Miri tests in CI #173

Open

Also enforce that both types are Copy in checked_transmute_copy. We c…

f55271b

…an't enforce Pod without an external dependency.

Shnatsel mentioned this pull request May 25, 2026

Safe transmute_copy() #232

Merged

Merge branch 'main' into avx512-yes-really

cd8192c

# Conflicts: # fearless_simd/src/generated/avx2.rs # fearless_simd/src/generated/neon.rs # fearless_simd/src/generated/sse4_2.rs # fearless_simd/src/generated/wasm.rs # fearless_simd_gen/src/generic.rs # fearless_simd_gen/src/level.rs

Shnatsel added 12 commits May 25, 2026 16:32

Fix disallowed methods setup that got mangled in the merge

15f5ab8

Drop a custom transmute_copy wrapper from tests now that it has the s…

6233743

…ame name but different semantics from the production code to avoid confusion

Optimize min_precise/max_precise for AVX-512, expand test coverage. A…

88bc247

…VX-512 has configurable comparison modes that we can use to implement the advertised _precise semantics.

Expand interleave/deinterleave test coverage

608b53f

Merge main PR linebender#233: remove unsafe loads

d45b511

Merged origin/main commit 13dd530. The merge applied cleanly.

Apply PR linebender#233 load safety pattern to AVX512

b03927f

Replace AVX512 interleaved load intrinsics emitted by the branch with checked_transmute_copy, then regenerate the generated AVX512 module.

Merge main PR linebender#234: replace by-value transmutes

fa81bb8

Merged origin/main commit fbc97da. The merge applied cleanly.

Apply PR linebender#234 transmute pattern to AVX512

b5de7ff

Regenerate the branch-added AVX512 module so by-value transmutes use checked_transmute_copy, matching PR linebender#234. Validation: cargo test

Merge main PR linebender#235: implement safe reference casts

0c9535b

Merged origin/main commit 0d13b0a. The merge applied cleanly.

Merge main PR linebender#237: replace NEON unsafe loads

a593499

Merged origin/main commit 650815d. The merge applied cleanly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX-512 support#231

Add AVX-512 support#231
Shnatsel wants to merge 40 commits into
linebender:mainfrom
Shnatsel:avx512-yes-really

Shnatsel commented May 24, 2026 •

edited

Loading

Uh oh!

LaurenzV commented May 25, 2026

Uh oh!

Shnatsel commented May 25, 2026

Uh oh!

Shnatsel commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Shnatsel commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LaurenzV commented May 25, 2026

Uh oh!

Shnatsel commented May 25, 2026

Uh oh!

Shnatsel commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shnatsel commented May 24, 2026 •

edited

Loading