Add SQ8↔FP16 x86 SIMD distance kernels [MOD-14954]#970
Conversation
Captures the architecture, file-level plan, CMake F16C gating, and risk register for adding AVX-512 / AVX2+FMA / AVX2 / SSE4 kernels for the asymmetric SQ8 (storage) ↔ FP16 (query) distance functions, wiring them into the existing dispatcher tables and SQ8_FP16 unit/benchmark scaffolding from MOD-15141. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
Enables _mm{,256}_cvtph_ps in the AVX2+FMA, AVX2, and SSE4 dispatcher
translation units so the upcoming SQ8↔FP16 kernels can widen FP16 lanes
to FP32. The flag is appended only when CXX_F16C is detected; existing
SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 sources contain no F16C intrinsics so
emitted code for those kernels is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parameterised gtest fixture mirroring SQ8_FP32_SpacesOptimizationTest; currently asserts only the scalar fallback path. Per-tier SIMD assertion blocks (AVX-512, AVX2+FMA, AVX2, SSE4) are added alongside the kernel implementations in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements asymmetric SQ8 (storage) ↔ FP16 (query) Inner Product, Cosine, and L2² kernels for the AVX-512 F+BW+VL+VNNI tier. Each chunk widens 16 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 16 FP16 lanes via _mm512_cvtph_ps, then fmadds into a 16-lane FP32 accumulator. SQ8 storage and FP16 query metadata reads use load_unaligned to tolerate odd dimensions. Dispatcher branches in IP_space.cpp / L2_space.cpp select the new Choose_SQ8_FP16_*_implementation_AVX512F_BW_VL_VNNI when features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni; otherwise behaviour is unchanged from MOD-15141. A parameterised gtest fixture exercises every residual class in [16, 32] against the scalar baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8-wide AVX2+FMA kernels widen 8 SQ8 lanes via cvtepu8_epi32 +
cvtepi32_ps and 8 FP16 lanes via _mm256_cvtph_ps, then fmadd into a
256-bit FP32 accumulator. Residual (< 8) lanes load the full 16-byte
FP16 block, convert, then blend zero across unused lanes — mirroring
the existing F16C FP16 kernel pattern. Dispatcher branch in
{IP,Cosine,L2}_SQ8_FP16_GetDistFunc selects the new
Choose_SQ8_FP16_*_implementation_AVX2_FMA when features.avx2 &&
features.fma3 && features.f16c.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the AVX2+FMA kernels but uses _mm256_mul_ps + _mm256_add_ps instead of _mm256_fmadd_ps so it can run on Haswell-era AVX2 hardware without FMA support (uncommon but matches the existing SQ8_FP32 tiering). Dispatcher gate requires features.avx2 && features.f16c and runs between the AVX2+FMA and SSE4 tiers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4-wide SSE4 kernels widen 4 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 4 FP16 lanes via _mm_cvtph_ps (F16C), then mul+add into a 128-bit FP32 accumulator (SSE4 has no FMA). Residual % 4 lanes are materialised via _mm_set_ps + the scalar FP16_to_FP32 helper, mirroring the existing SSE4 SQ8_FP32 residual pattern. Dispatcher gate requires features.sse4_1 && features.f16c && features.avx since F16C is VEX-encoded — matches the existing F16C/FP16 dispatcher gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SQ8_FP16 GetDistFunc dispatcher now returns AVX-512 / AVX2+FMA / AVX2 / SSE4 SIMD kernels when the corresponding feature flags are set (only scalar previously). Updates the GetDistFunc_*_SQ8_FP16 asserts to compute the expected function for the host's highest supported tier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds AVX-512 / AVX2+FMA / AVX2 / SSE4 benchmark registrations to bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout. Gates each tier on the corresponding OPT_* defines plus the runtime feature checks that mirror the dispatcher in IP_space.cpp / L2_space.cpp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a43eac4 to
e21cb3b
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #970 +/- ##
==========================================
+ Coverage 96.99% 97.05% +0.05%
==========================================
Files 130 141 +11
Lines 7793 8105 +312
==========================================
+ Hits 7559 7866 +307
- Misses 234 239 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- CMake: gate `-mf16c` on CXX_F16C AND CXX_FMA AND CXX_AVX (matches OPT_F16C
macro) and append `-mavx` to the SSE4 dispatcher when adding -mf16c, since
F16C is VEX-encoded and requires AVX state. Mirrors the existing F16C.cpp
recipe and prevents miscompiles on toolchains with F16C but without AVX.
- IP_SSE4_SQ8_FP16.h: replace `*reinterpret_cast<const int32_t *>(pVect1)`
with `load_unaligned<int32_t>(pVect1)` to remove strict-aliasing UB on
the uint8_t SQ8 lane load.
- IP_AVX2{,_FMA}_SQ8_FP16.h: improve the residual-mask comment to spell out
the asymmetric-mask reasoning (SQ8 unmasked is safe because the FP16
query blend forces those FP32 query lanes to 0 → garbage·0=0).
- IP_AVX{512,2,2_FMA,SSE4}_SQ8_FP16.h: add the `IP = min·y_sum + delta·Σ(q·y)`
algebraic-identity comment header that AVX-512 already carried, plus a
precondition note that callers must enforce dim >= 16 (matches the
established SQ8_FP32 convention; no runtime assert because sibling
SQ8_FP32 SIMD kernels also rely on the dispatcher gate).
- test_spaces.cpp: route the SQ8_FP16 edge-case tests (ZeroQuery,
ConstantStorage, MixedSignQuery) through {IP,Cosine,L2}_SQ8_FP16_GetDistFunc
so the runtime-selected SIMD tier is actually exercised on those inputs,
not just the scalar reference.
- test_spaces.cpp: add SQ8_FP16_SIMD_HighDim suite with dims {64, 128, 256,
512, 1024} so multi-iteration do-while loop bugs would fire (the existing
[16, 32] range covers at most two AVX-512 chunk iterations).
- test_spaces.cpp: add SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised — a
single test that emits per-tier coverage to stderr and GTEST_SKIPs when
no SIMD tier is available, so CI runners without AVX-512 do not silently
report zero tier-1 coverage.
- test_spaces.cpp: scalar-fallback `alignment` checks now seed the value
with 0xFF and assert it remains 0xFF, verifying the dispatcher contract
("scalar leaves caller's value untouched") instead of just measuring
that the variable's pre-zeroed init survived.
- test_spaces.cpp: drop the stale MOD-15152/MOD-15153 wiring-TODO comment
on SQ8_FP16_NoOptimizationSpacesTest now that the SIMD tiers are wired.
- bm_spaces_sq8_fp16.cpp: drop the matching stale comment.
Out of scope (separate ticket): two-accumulator FMA refactor (also affects
SQ8_FP32) and the SSE4 residual `_mm_cvtph_ps` perf opportunity.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Break the FMA / mul+add dependency chain in all four SQ8↔FP16 IP kernels by widening the inner loop to use multiple independent accumulators. L2 kernels inherit the change through their `…InnerProductImp_…` call. - IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h: 1 → 4 accumulators, unroll-4 main loop (64 lanes/iter) with a 16-lane tail for the 0..3 remaining chunks. - IP_AVX2_FMA_SQ8_FP16.h, IP_AVX2_SQ8_FP16.h: 1 → 2 accumulators; the existing 2-step unrolled body now routes each step to an independent accumulator. The `residual >= 8` half-chunk feeds the second accumulator so the prologue also breaks the dependency chain. - IP_SSE4_SQ8_FP16.h: 1 → 2 accumulators; do-while unrolled 1 → 2 steps per iteration (4 → 8 lanes/iter). Residual-ladder steps alternate between sum_a and sum_b for prologue ILP. Correctness invariant: residual block consumes exactly `residual` lanes (0..15) → remaining tail is always a multiple of 16, so the unrolled loops (multiples of 8 / 16 / 64) terminate exactly. Verified by 131 SQ8_FP16 unit tests + 115 under ASan.
The SQ8↔FP16 AVX-512 kernel does not actually issue any VNNI instruction
— the inner loop is FP32 FMA (`_mm512_fmadd_ps`) over lanes widened from
SQ8 (`_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps`) and FP16
(`_mm512_cvtph_ps`). Real VNNI use would require an integer-encoded
query, which is a different kernel entirely.
The file/function names are renamed to match what the kernel actually
uses (AVX-512F). The dispatcher .cpp/.h files stay named after the
runtime tier (AVX512F_BW_VL_VNNI) since the SQ8↔FP16 kernel still
registers under that tier alongside the genuinely VNNI-using SQ8↔SQ8 /
INT8 / UINT8 kernels — the gate is a CPU-feature gate, not an ISA claim.
The same misnomer exists for SQ8↔FP32; tracked separately so the rename
there can ship as its own commit.
Also: fix a strict-aliasing-class UB introduced by the AVX-512 unroll-4
loop. `while (pVec1 + 64 <= pEnd1)` forms a pointer past one-past-end of
the SQ8 storage object when fewer than 64 lane bytes remain, which is UB
in C++ regardless of dereference. Switched to pointer subtraction
(`static_cast<size_t>(pEnd1 - pVec1) >= 64`).
Renames:
- IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> IP_AVX512F_SQ8_FP16.h
- L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> L2_AVX512F_SQ8_FP16.h
- SQ8_FP16_{InnerProduct,Cosine,L2Sqr}SIMD16_AVX512F_BW_VL_VNNI -> _AVX512F
- Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI -> _AVX512F
Verified: 131 SQ8_FP16 unit tests + 115 under ASan.
Design doc was added in ad941b8 for planning; not appropriate as a long-lived in-repo artifact. Keep externally (Confluence / scratch) rather than ship with the kernel commit.
Two trims, both restoring pre-existing patterns elsewhere in the file:
1. `GetDistFuncSQ8FP16Asymmetric` had grown a runtime SIMD-tier walk that
duplicated coverage already provided by `SQ8_FP16_SpacesOptimizationTest`.
Reduced to the bare dispatcher-equality check used by the FP32 / SQ8↔SQ8
sister tests at lines 540-548 and 551-559.
2. The `SQ8_FP16_EdgeCases` tests (`ZeroQueryTest`, `ConstantStorageTest`,
`MixedSignQueryTest`) were routed through
`{IP,Cosine,L2}_SQ8_FP16_GetDistFunc(dim, nullptr)` to force runtime SIMD
dispatch on adversarial inputs. Reverted to direct scalar calls
(`SQ8_FP16_InnerProduct`, etc.) — the original pre-fdc5c1cd shape.
Coverage rationale: the SIMD kernels are branchless on input values
(verified by grep — no value-dependent `if` in any tier). Every code
path is therefore exercised by `SQ8_FP16_SpacesOptimizationTest`'s
random inputs at multiple dims. The edge-case tests verify the
*algebraic identity* (IP of zero query = 1.0, constant storage matches
dequant baseline, mixed-sign handling) — scalar correctness on these
inputs is what was actually being checked, and the SIMD path matches
scalar via the SpacesOptimizationTest tier walk.
Net: 77 lines removed from the test file, matches sister conventions,
no coverage gap.
The SQ8↔FP16 kernels in the SSE4, AVX2, and AVX2+FMA tiers depend on F16C (`_mm_cvtph_ps` / `_mm256_cvtph_ps`), while every other kernel in those dispatcher TUs is F16C-clean. The previous arrangement mixed both under `#ifdef OPT_F16C` blocks inside the base dispatcher .cpp/.h files. Split each tier's F16C-dependent kernels off into a sibling TU: functions/SSE4.cpp → SSE4 + SQ8_FP32 (no F16C) functions/SSE4_F16C.cpp → SQ8_FP16 only (requires -mavx -mf16c) functions/AVX2.cpp → AVX2 + BF16 + SQ8_FP32 (no F16C) functions/AVX2_F16C.cpp → SQ8_FP16 only (requires -mf16c) functions/AVX2_FMA.cpp → SQ8_FP32 (no F16C) functions/AVX2_FMA_F16C.cpp → SQ8_FP16 only (requires -mf16c) The AVX-512 tier is unaffected — its SQ8_FP16 kernel uses `_mm512_cvtph_ps`, which is part of AVX-512F and not F16C. CMake now compiles each sibling TU conditionally on `_has_full_f16c` and applies the F16C flags only there. Base TUs no longer carry `-mf16c`, since they no longer reference F16C intrinsics. Result: - No `#ifdef OPT_F16C` directives in `functions/*.cpp` or `functions/*.h`. - Compile-time isolation: an F16C intrinsic accidentally added outside a `_F16C` sibling will fail to build, not silently miscompile. - Caller sites (`IP_space.cpp`, `L2_space.cpp`, `test_spaces.cpp`, `bm_spaces.h`) still gate the *calls* with `#ifdef OPT_F16C`; the new sibling .h includes are unconditional, since declarations alone don't link-error and the calls remain guarded. Verified: 131 SQ8_FP16 unit tests + 115 ASan + 1166 full test_spaces suite (covers other ISA tiers SQ8_FP32 / BF16 / INT8 / UINT8 to confirm no regression from the dispatcher restructure).
…[MOD-14954]
Two related cleanups in the SQ8↔FP16 dispatch path:
1. The AVX-512 SQ8↔FP16 kernel only uses AVX-512F instructions
(`_mm512_cvtph_ps`, `_mm512_fmadd_ps`, etc.) but was registered under
the VNNI tier (`OPT_AVX512_F_BW_VL_VNNI` + check of avx512f/bw/vl/vnni).
That meant CPUs with AVX-512F but no VNNI (Skylake-X, some Cascade Lake
variants, etc.) would fall through to AVX2_FMA even though they can
run the AVX-512 kernel.
Moved the `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F`
definitions from `functions/AVX512F_BW_VL_VNNI.cpp` to
`functions/AVX512F.cpp`, with matching header reshuffle. Dispatch
sites now gate on `OPT_AVX512F` + `features.avx512f`.
2. F16C is a transversal requirement across the non-AVX-512 SQ8↔FP16
tiers (SSE4, AVX2, AVX2+FMA) — every one of them widens FP16 query
lanes via `vcvtph2ps`. Per-tier nested `#ifdef OPT_F16C` was hoisted
into a single outer block around the three ISA branches in
`IP_SQ8_FP16_GetDistFunc`, `Cosine_SQ8_FP16_GetDistFunc`, and
`L2_SQ8_FP16_GetDistFunc`.
Verified: 131 SQ8_FP16 release + 115 ASan + 1166 full test_spaces suite.
Remove extraneous blank lines in SSE4 and AVX2_FMA source files, fix indentation in AVX512F SQ8_FP16 function signatures, and reformat benchmark macro invocation to fit line length conventions.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 839fe3c. Configure here.
| any_simd = true; | ||
| } else { | ||
| std::cerr << "[SQ8_FP16] AVX-512 F+BW+VL+VNNI tier NOT exercised on this host\n"; | ||
| } |
There was a problem hiding this comment.
Tier coverage test checks wrong AVX-512 preprocessor guard
Low Severity
The ReportTiersExercised test gates the AVX-512 check on OPT_AVX512_F_BW_VL_VNNI, but the SQ8_FP16 kernels are dispatched under OPT_AVX512F (as seen in IP_space.cpp and L2_space.cpp and in the SQ8_FP16_SpacesOptimizationTest at line 3097). On CPUs with AVX-512F but without BW+VL+VNNI, the SIMD optimization tests correctly exercise the AVX-512F tier, but this coverage reporter incorrectly omits it — potentially causing any_simd to remain false and the test to spuriously skip.
Reviewed by Cursor Bugbot for commit 839fe3c. Configure here.
The comments referencing SQ8-to-FP16 dispatch location are no longer accurate after the recent refactoring that moved the dispatch logic. Clean up these stale comments from the AVX512F_BW_VL_VNNI files.


Summary
IP_space.cpp/L2_space.cpp) with F16C gating on the AVX2/SSE4 tiers; AVX-512 path uses_mm512_cvtph_pswhich is part of AVX512F (no F16C requirement).-mf16cconditionally to AVX2_FMA / AVX2 / SSE4 dispatcher source files in CMake (purely additive — no SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 codegen change since those sources contain no F16C intrinsics).SQ8_FP16_SpacesOptimizationTestto walk all four ISA tiers against the scalar baseline across dim ∈ [16, 32]; update existingGetDistFunc_*_SQ8_FP16assertions accordingly.bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout.Spec:
docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.mdPlan:
docs/superpowers/plans/2026-05-26-sq8-fp16-x86-kernels.mdTest plan
🤖 Generated with Claude Code
Note
Medium Risk
Changes hot-path vector distance math and runtime ISA selection; incorrect dispatch or SIMD numerics could affect search quality, though tier tests assert ~0.01f agreement with scalar and builds gate intrinsics by CPU flags.
Overview
Adds x86 SIMD for asymmetric SQ8 (index) ↔ FP16 (query) distances: inner product, cosine, and L2², with scalar fallback unchanged for
dim < 16or missing ISA features.Build: F16C-dependent code moves into separate objects (
AVX2_F16C,AVX2_FMA_F16C,SSE4_F16C) compiled with-mf16c(and-mavxfor SSE4) only when the toolchain reports F16C+FMA+AVX; baseAVX2/AVX2_FMA/SSE4TUs stay F16C-free. AVX-512 SQ8↔FP16 kernels live in the existingAVX512FTU (uses_mm512_cvtph_ps, no separate F16C flag).Runtime:
IP_SQ8_FP16_GetDistFunc,Cosine_SQ8_FP16_GetDistFunc, andL2_SQ8_FP16_GetDistFuncselect AVX-512F → AVX2+FMA+F16C → AVX2+F16C → SSE4.1+F16C+AVX, set SQ8 alignment hints whendimis a multiple of the chunk size, and otherwise return the scalar implementations.Validation: Parameterized tests compare each tier to the scalar baseline; benchmarks register the same ISA ladder beside the naive kernels.
Reviewed by Cursor Bugbot for commit 3565985. Bugbot is set up for automated code reviews on this repo. Configure here.